CN115546442A - Multi-view stereo matching reconstruction method and system based on perception consistency loss - Google Patents
Multi-view stereo matching reconstruction method and system based on perception consistency loss Download PDFInfo
- Publication number
- CN115546442A CN115546442A CN202211390106.8A CN202211390106A CN115546442A CN 115546442 A CN115546442 A CN 115546442A CN 202211390106 A CN202211390106 A CN 202211390106A CN 115546442 A CN115546442 A CN 115546442A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- layer
- loss
- depth
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000008447 perception Effects 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 56
- 230000006870 function Effects 0.000 claims abstract description 38
- 238000012937 correction Methods 0.000 claims abstract description 15
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 230000000694 effects Effects 0.000 claims abstract description 14
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 238000012360 testing method Methods 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims abstract description 6
- 230000004913 activation Effects 0.000 claims description 14
- 238000005070 sampling Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000005457 optimization Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 6
- 230000009471 action Effects 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 208000006440 Open Bite Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 230000003238 somatosensory effect Effects 0.000 description 1
- 239000013585 weight reducing agent Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/20—Finite element generation, e.g. wire-frame surface description, tesselation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
A multi-view stereo matching reconstruction method and a system based on perception consistency loss belong to the field of three-dimensional reconstruction, and aim to solve the problems of serious memory occupation and low integrity in the prior art, the method comprises the following steps of 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module; step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points; step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training; step 4, designing a loss function and regressing; step 5, saving the model and testing; step 6, point cloud fusion and reconstruction: and (5) fusing the three-dimensional point clouds obtained in the step (5), checking the fused matching reconstruction effect by using MeshLab software, and selecting an optimal evaluation index to measure the accuracy of the algorithm and the performance of the system.
Description
Technical Field
The invention relates to a multi-view stereo matching reconstruction method and system based on perception consistency loss, and belongs to the technical field of three-dimensional reconstruction.
Background
The three-dimensional reconstruction is widely applied to the aspects of industrial measurement, intelligent robots, unmanned systems, medical diagnosis, digital city modeling, somatosensory entertainment and the like. The multi-view stereo matching is an end-to-end learning method belonging to a three-dimensional reconstruction technology, belongs to passive three-dimensional reconstruction, and has the characteristics of low cost, simple structure and good practicability. Since each stereo matching method has its own limitations, the integrity of matching and the weight reduction of the network have high requirements in both the conventional method and the learning method. Therefore, optimization of the matching depth map is essential for the effect of subsequent point cloud fusion. In order to obtain effective depth information in better dense reconstruction, the existing multi-view stereo matching reconstruction method is mostly completed based on pixels. However, the reconstruction method has two key problems, namely, the light weight cannot be effectively realized due to long operation time, and the integrity is high while the precision is ensured.
Chinese patent publication No. CN113963117A entitled "a method and apparatus for multi-view three-dimensional reconstruction based on variable convolution depth network", the method comprises inputting a source image and reference images of multiple view angles; then, extracting the features of the input image through a multi-scale feature network constructed by deformable convolution; then, iterative optimization calculation of pixel depth matching and edge processing is carried out by adopting a learning-based patch matching iterative model to obtain an iteratively optimized depth map; and finally, importing the depth map and the source image subjected to iterative optimization into a depth residual error network for optimization to obtain a final depth map, and performing three-dimensional reconstruction to obtain a stereoscopic vision map. The method adopts a supervised learning method, and has serious memory occupation and lower integrity.
Disclosure of Invention
The invention provides a multi-view stereo matching reconstruction method based on perception consistency loss, which aims to solve the problems of serious memory occupation and low integrity in the existing multi-view stereo matching method. The obtained depth map is smoother and more complete, has better reconstruction effect on subsequent point cloud fusion, and better conforms to the detailed observation of objects.
The technical scheme for solving the technical problem is as follows:
the multi-view stereo matching reconstruction method based on the perception consistency loss comprises the following steps:
step 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module, wherein each module consists of a convolution layer, a regularization function and an activation function; the pyramid volume regularization module performs downsampling extraction and upsampling integration regularization on the extracted features to construct a 3D cost volume so as to obtain a dense depth map; finally, the double-correction-layer module simply filters the obtained depth map to remove redundant information, optimizes depth map combination and retains useful information;
step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points;
step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training;
step 4, designing a loss function and regressing: performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss to obtain effective earth surface depth values, and transmitting the effective earth surface depth values to depth regression loss as supervision until the training times reach a set threshold or the value of a loss function reaches a set range, namely the model parameters are considered to be trained and stored;
and step 5, storing the model and testing: solidifying the finally determined model parameters, and directly inputting the image into a network to obtain a final three-dimensional point cloud when point cloud fusion and reconstruction operation are required;
step 6, point cloud fusion and reconstruction: fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; and meanwhile, the quality of the model is not further verified, and the optimal evaluation index is selected to measure the accuracy of the algorithm and the performance of the system.
The feature extraction module in the step 1 comprises eight convolution layers, wherein downsampling operation is carried out on a convolution layer I, a convolution layer II and a convolution layer III, the feature diagram is downsampled by reducing the size of the feature diagram by half, a convolution layer IV, a convolution layer V and a convolution layer VI, the size of the feature diagram is reduced by half, and finally the feature diagram is output by a convolution layer seven and a convolution layer eight. The output of each convolution layer of the characteristic extraction module is subjected to regularization operation; the pyramid volume regularization module comprises ten convolution layers, wherein a first time of downsampling is carried out on the extracted features of a first three-dimensional convolution layer pair and a second time of downsampling is carried out on the extracted features of a third three-dimensional convolution layer pair and a fourth three-dimensional convolution layer pair, and feature information after convolution is output by a fifth three-dimensional convolution layer, a sixth three-dimensional convolution layer and a seventh three-dimensional convolution layer. The three-dimensional deconvolution layer performs up-sampling on one pair of inputs, outputs deconvoluted feature information through an activation function, performs up-sampling on the outputs of two pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, performs up-sampling on the outputs of three pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, and finally integrates the feature information to construct a 3D cost volume to obtain a dense depth map. The output of each convolution layer of the pyramid volume regularization module is regularized; the double-correction-layer module comprises four convolution layers, wherein the convolution layer I, the convolution layer II and the convolution layer III are used for simply filtering and removing redundant information from the obtained depth map, and the convolution layer IV is used for performing up-sampling and channel conversion on input to optimize depth map combination and retain useful information. The output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of the two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.
In the step 3, a DTU is used for a data set in a training process; estimating the pose of a reconstructed object from a multi-frame image to perform semi-supervised training by calculating internal and external parameters of a camera in a data set; the method is characterized in that multi-frame pictures, calibrated camera parameters and depth information are used as input of the whole network, earth surface information matched with loss calculation is used as a label, and the problem that most of the fields of three-dimensional reconstruction can only be supervised training is solved.
In the step 4, in the training process, a loss function selects a combination of data enhancement consistent loss and depth perception loss; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, and needs a detailed part of the authenticity of the surface of the protrusion, so that the matching reconstruction effect of the fused point cloud is improved.
After the model is saved in the step 5, an ETH3D data set and a Tank & Temples data set can be used for testing to evaluate the generalization ability of the model.
And 6, evaluating the accuracy and the integrity of the index selection, effectively evaluating the efficiency of the algorithm and measuring the action of the matching network.
The invention also provides a multi-view stereo matching reconstruction system based on perception consistency loss, which comprises the following steps:
the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove partial earth surface information so as to carry out semi-supervised training;
the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; and inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map.
The depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;
and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into a three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.
The invention has the following beneficial effects:
1. the method for achieving semi-supervision by using the depth image subjected to loss calculation as supervision can reduce the occupation and the calculation amount of the memory, and regresses the three-dimensional points of the corresponding pixel back projection to the actual meeting position in the three-dimensional world, thereby enhancing the robustness of the network. And the computed smooth depth map ensures that the integrity of point cloud matching reconstruction is higher and the effect is better.
2. The double correction layers are used in the backbone network for edge check, the depth information value is utilized to the maximum extent, and the problem that edge shielding information is wrong can be effectively solved.
3. The whole training network uses splicing operation on the two branches to mix effective depth information of a correction layer and a loss layer, so that the network has stronger computing loss capability on images with two different depths; and deep regression is added in the loss function, so that the number of parameters of the network is small, the whole network is simple in structure, and the reconstruction precision is high.
Drawings
Fig. 1 is a flowchart of a multi-view stereo matching reconstruction method based on perceptual uniform loss.
Fig. 2 is a network structure diagram of a multi-view stereo matching reconstruction method based on perceptual consistency loss.
Fig. 3 is a specific component of the feature extraction module according to the present invention.
Fig. 4 is a specific composition of the pyramid volume regularization module of the present invention.
Fig. 5 is a specific composition of the dual calibration layer module according to the present invention.
Fig. 6 is a schematic structural diagram of a multi-view stereo matching reconstruction system based on perceptual uniform loss according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the multi-view stereo matching reconstruction method based on perceptual uniform loss specifically includes the following steps:
step 1, constructing a network model. As shown in fig. 2, the entire network includes a feature extraction module, a pyramid cost volume regularization module, and a double correction layer module, and each module is composed of a convolution layer, regularization, and activation functions. As shown in fig. 3, in the feature extraction module, downsampling is performed on the convolutional layer one, convolutional layer two, and convolutional layer three, the feature map is reduced by half, the convolutional layer four, convolutional layer five, and convolutional layer six, downsampling is performed on the feature map, the feature map is reduced by half, and finally, the feature map is output by the convolutional layer seven and convolutional layer eight. The output of each convolution layer of the characteristic extraction module is subjected to regularization operation; as shown in fig. 4, in the pyramid volume regularization module, the first downsampling is performed on the extracted features by the three-dimensional convolution layer one and the three-dimensional convolution layer two pairs, the second downsampling is performed on the extracted features by the three-dimensional convolution layer three and the three-dimensional convolution layer four pairs, and feature information after convolution is output through the three-dimensional convolution layer five, the three-dimensional convolution layer six and the three-dimensional convolution layer seven. The three-dimensional deconvolution layer performs up-sampling on one pair of inputs, outputs deconvoluted feature information through an activation function, performs up-sampling on the outputs of two pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, performs up-sampling on the outputs of three pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, and finally integrates the feature information to construct a 3D cost volume to obtain a dense depth map. The output of each convolution layer of the pyramid volume regularization module is regularized; as shown in fig. 5, in the double correction layer module, the depth map obtained by the convolutional layer one, convolutional layer two, and convolutional layer three is simply filtered to remove redundant information, and the convolutional layer four is input to perform upsampling and channel conversion to optimize depth map combination and retain useful information. The output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.
And 2, preparing a data set. The original 3D points are used from the reference view to estimate a dense depth map D of the three-dimensional structure by semi-supervised training of the data set, taking as input the source image and the reference image.
And 3, training a network model. And (3) inputting the data set prepared in the step (2) into the network model constructed in the step (1) for training. And the data set uses DTUs during the training process; estimating the pose of a reconstructed object from a plurality of frames of images to carry out self-supervision learning training by calculating internal and external parameters of a camera in a data set; the method is characterized in that multi-frame pictures, calibrated camera parameters and depth information are used as input of the whole network, earth surface information matched with loss calculation is used as a label, and the problem that most of the fields of three-dimensional reconstruction can only be supervised training is solved.
And 4, designing a loss function and regressing. As for the loss designed in fig. 2, the depth information obtained by data enhancement consistent loss and the depth information obtained by depth perception loss are subjected to parallax comparison to obtain an effective earth surface depth value, the effective earth surface depth value is used as supervision and transmitted to the depth regression loss, the training of the model parameters can be considered to be completed until the training times reach a set threshold value or the value of the loss function reaches a set range, and the model parameters are stored; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, can project the authenticity detail part of the surface of the object, and improves the matching reconstruction effect of the fused point cloud.
And 5, storing the model and testing. And (3) solidifying the finally determined model parameters, and directly inputting the image into a network to obtain the final three-dimensional point cloud when point cloud fusion and reconstruction operation are required. The model is used for testing an ETH3D data set and a Tank & templates data set, and the generalization capability of the model is evaluated.
And 6, point cloud fusion and reconstruction. Fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; meanwhile, the accuracy and the integrity of the optimal evaluation index selection can effectively evaluate the efficiency of the algorithm and measure the action of the matching network.
The embodiment is as follows:
as shown in fig. 1, the multi-view stereo matching reconstruction method based on perceptual consistency loss specifically includes the following steps:
step 1, constructing a network model. As shown in fig. 2, the entire network includes a feature extraction module, a pyramid volume regularization module, and a double correction layer module, and each module is composed of a convolution layer, regularization, and an activation function. As shown in fig. 3, the feature extraction module includes eight two-dimensional convolution layers, the convolution kernel size of convolution layer three and convolution layer six is 5 × 5, and the step length and the padding are both 2; convolution kernels of the convolution layer I, the convolution layer II, the convolution layer IV, the convolution layer V, the convolution layer VII and the convolution layer VIII are 3 x 3, and step length and filling are all 1; as shown in fig. 4, the pyramid volume regularization module contains ten three-dimensional convolutional layers, the convolution kernels of all layers are 3 × 3 × 3, and the step sizes and the padding of the three-dimensional convolutional layer one, three-dimensional convolutional layer three, three-dimensional convolutional layer five, and three-dimensional convolutional layer seven are 2 and 1. The step length and the filling of the three-dimensional deconvolution layer I, the three-dimensional deconvolution layer II and the three-dimensional deconvolution layer III are all 1; as shown in fig. 5, the first correction layer and the second correction layer in the dual correction layer module have the same structure and include four two-dimensional convolution layers, convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer are all 3 × 3, step length and padding are all 1, and the splicing dimension is 1. The linear rectification function regularization function is defined as follows:
in the formula, X and y are training samples and corresponding labels, and omega is a weight coefficient vector; j (.) is an objective function, and omega (omega) is a penalty term; the parameter alpha controls the regularization strength, so that the complexity of the model is controlled, and overfitting is reduced.
And 2, preparing a data set. The data set uses a DTU data set. The data set consists of 124 different objects or scenes, each object takes 49 views, each view has 7 different brightnesses, so there are 343 pictures inside each object or scene folder, and the data set also carries the training image set with the depth map truth values. The resolution of each image is 1600 × 1200.
And 3, training a network model. The network carries out semi-supervised training on the pose of the multi-frame image estimation reconstruction object. The semi-supervised training is to use multi-frame pictures, calibrated camera parameters and a small amount of depth information as the input of the whole network, and use surface information matched with loss calculation as a label, so that the problem that most of the fields of three-dimensional reconstruction can only be subjected to supervised training is solved, and the system occupies a small memory and has high integrity.
And 4, designing a loss function and regressing. As the loss designed in fig. 2, by performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss, an effective earth surface depth value is obtained and is used as supervision to be transmitted to the depth regression loss to optimize a depth map, and thus a better matching reconstruction effect is achieved. The data enhancement coherency loss is defined as follows:
wherein, M τθ Representing the mask unobstructed at τ θ, D represents the predicted value of the regular forward channel of the corrected image,extended picture prediction value expressed as a corrected picture by minimizing D andbetweenTo ensure data enhanced consistency.
The depth perception loss is defined as follows:
p′=K′(R′R -1 (D(p)K -1 p-t)+t′)
P=D(p)K -1 p-t
P′=D(p′)K′ -1 p′-t′
p 'is expressed as the estimated corresponding pixel of P, the back projected three-dimensional depth information perception points D (P) and C (q) of P and P' are expressed as the predicted depth and probability value of P pixel, and epsilon h Threshold representing highly reliable depth prediction, e ω Representing a threshold for filtering out mismatching pixels. Wherein the basis is ∈ h And e ω The depth perception prediction values obtained by the two thresholds can approximately detect edges, occlusion and wrong non-occlusion areas, and the areas can form wrong corresponding relations.
The depth regression loss is defined as follows:
L REG =λ 1 L SSIM +λ 2 L Smooth
wherein L is SSIM And L Smooth For structural similarity loss and depth smoothing loss, both are conditional losses for common depth estimation, so L SSIM Can be expressed as:
x and y respectively represent pixel points of a window with the size of NxN in the two images, and mu x And mu y Respectively representing the mean values of x and y, which can be used as brightness estimation; sigma x And σ y The variances of x and y are respectively expressed and can be used as contrast estimation; sigma xy Representing the covariance of x and y, which can be used as a structural similarity measure. c. C 1 And c 2 For a minimum parameter, a denominator of 0 can be avoided, typically 0.01 and 0.03 respectively. L is Smooth Can be expressed as:
From the above four loss functions, a loss function with a weight as a whole can be constructed as follows:
L=L R +λ 3 L DA +λ 4 L DPR =λ 1 L SSIM +λ 2 L Smooth +λ 3 L DA +λ 4 L DPR
the weights can be assigned as follows through a large number of comparisons and a plurality of tests: lambda [ alpha ] 1 =0.2,λ 2 =0.0067,λ 3 =0.1,λ 4 =0.8。
The training times are set to be 16, the upper limit of the number of the pictures input to the network each time is mainly determined according to the performance of a computer graphic processor, and generally, the number of the pictures input to the network each time is within a range of 1-4, so that the network training can be more stable, the training result is better, and the rapid fitting of the network is ensured. In the training process, the learning rate of the parameters is 0.001, so that the fast fitting of the network can be ensured, and the overfitting of the network cannot be caused. The algorithm of the parameter optimizer selects an adaptive matrix estimation algorithm, and the method has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable. The threshold of the penalty function is set to about 0.0003, and less than 0.0003 can be considered that the training of the entire network is substantially complete.
And 5, saving the model and testing. Training is performed using the DTU data set and the training model is saved. The model is used for testing an ETH3D data set and a Tank & Temples data set, and the generalization capability of the ETH3D data set and the Tank & Temples data set is evaluated.
And 6, point cloud fusion and reconstruction. Subsequent texture enhancement and grid mapping belong to the traditional algorithm, the point cloud after depth map fusion can be directly used for matching, and the reconstruction effect can be directly checked by using MeshLab software to verify the quality of a matching network; meanwhile, as shown in table 1, the accuracy and integrity of the optimal evaluation index selection can effectively evaluate the efficiency of the algorithm and measure the effect of the matching network. The formula for accuracy is as follows:
wherein G is the earth surface frame value model, G is a point in the earth surface frame value model, R is the reconstruction model, and R is a point in the earth reconstruction model. The term "is an Affyson bracket, and if the condition in the Affyson bracket is satisfied, it is 1, and if not, it is zero. d is a distance coefficient.
The formula for the integrity is as follows:
wherein G is the earth surface frame value model, G is a point in the earth surface frame value model, R is the reconstruction model, and R is a point in the earth reconstruction model. The term "1" means an Aftersen bracket, and when the condition in the Aftersen bracket is satisfied, it is zero. d is a distance coefficient.
As shown in fig. 6, the multi-view stereo matching reconstruction system based on perceptual uniform loss provided by the present invention includes:
the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove partial earth surface information so as to carry out semi-supervised training;
the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; and inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map.
The depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;
and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.
The implementation of convolution, activation function, concatenation operation, regularization operation, gaussian filtering is an algorithm well known to those skilled in the art, and the specific procedures and methods can be referred to in the corresponding textbooks or technical literature.
According to the invention, by constructing the multi-view stereo matching network based on the perception consistency loss, the depth map can be directly obtained by using the source image and the reference image and fused into the point cloud without other intermediate steps, and the algorithm of manually designing stereo matching in the traditional method is avoided. The feasibility and the superiority of the method are further verified by calculating the relevant indexes of the depth map obtained by the prior art. The correlation indexes of the methods proposed by the prior art and the present invention are shown in table 1:
TABLE 1 comparison of relevant indexes of the prior art and the method proposed by the present invention
It is worth noting that the learning method of the method and the system is an end-to-end learning method in the semi-supervised field, and the designed loss function belongs to an important realization function of the invention.
Claims (7)
1. A multi-view stereo matching reconstruction method based on perception consistency loss is characterized by comprising the following steps:
step 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module, wherein each module consists of a convolution layer, a regularization function and an activation function; the pyramid volume regularization module is used for performing down-sampling extraction and up-sampling integration regularization on the extracted features to construct a 3D cost volume so as to obtain a dense depth map; finally, the double-correction-layer module simply filters the obtained depth map to remove redundant information, optimizes depth map combination and retains useful information;
step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points;
step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training;
step 4, designing a loss function and regressing: performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss to obtain effective earth surface depth values, and transmitting the effective earth surface depth values to depth regression loss as supervision until the training times reach a set threshold or the value of a loss function reaches a set range, namely the model parameters are considered to be trained and stored;
and 5, saving the model and testing: solidifying the finally determined model parameters, and directly inputting the images into a network to obtain the final three-dimensional point cloud when point cloud fusion and reconstruction operation are required;
step 6, point cloud fusion and reconstruction: fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; meanwhile, in order to further verify the quality of the model, the optimal evaluation index is selected to measure the accuracy of the algorithm and the performance of the system.
2. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the feature extraction module in the step 1 includes eight convolutional layers, wherein a first convolutional layer, a second convolutional layer, and a third convolutional layer perform downsampling, the feature map size is reduced by half, a fourth convolutional layer, a fifth convolutional layer, and a sixth convolutional layer perform downsampling on the feature map, the feature map size is reduced by half again, and finally the feature map size is output by a seventh convolutional layer and an eighth convolutional layer; the output of each convolution layer of the characteristic extraction module is subjected to regularization operation; the pyramid volume regularization module comprises ten convolutional layers, wherein the first downsampling is carried out on the extracted features by a three-dimensional convolutional layer I and a three-dimensional convolutional layer II, the second downsampling is carried out on the extracted features by a three-dimensional convolutional layer III and a three-dimensional convolutional layer IV, and feature information after convolution is output by a three-dimensional convolutional layer V, a three-dimensional convolutional layer six and a three-dimensional convolutional layer seven; the method comprises the following steps that a pair of inputs of a three-dimensional deconvolution layer are subjected to up-sampling, feature information after deconvolution is output through an activation function, outputs of two pairs of three-dimensional deconvolution layers of the three-dimensional deconvolution layer are subjected to up-sampling, feature information after deconvolution is output through the activation function, outputs of three pairs of three-dimensional deconvolution layers of four three-dimensional convolution layers are subjected to up-sampling, feature information after deconvolution is output through the activation function, and finally the feature information is integrated to construct a 3D cost volume to obtain a dense depth map; the output of each convolution layer of the pyramid volume regularization module is regularized; the double-correction-layer module comprises four convolution layers, wherein the convolution layer I, the convolution layer II and the convolution layer III carry out simple filtering on the obtained depth map to remove redundant information, and the convolution layer IV carries out up-sampling and channel conversion on input to optimize depth map combination and reserve useful information; the output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of the two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.
3. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the DTU is used for the data set in the training process in the step 3; estimating the pose of a reconstructed object from a plurality of frames of images to carry out semi-supervised training by calculating internal and external parameters of a camera in a data set; the method comprises the steps of taking multi-frame pictures, calibrated camera parameters and depth information as input of the whole network, taking earth surface information matched with loss calculation as a label, and solving the problem that most of the fields of three-dimensional reconstruction can only be supervised training.
4. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the loss function in the step 4 selects a combination of the data enhancement uniform loss and the depth perception loss in a training process; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, and needs a detailed part of the authenticity of the surface of the protrusion, so that the matching reconstruction effect of the fused point cloud is improved.
5. The method for reconstructing multi-view stereo matching based on perceptual uniform loss as claimed in claim 1, wherein in the step 5, after the model is saved, the ETH3D dataset and the Tank & Temples dataset are used for testing to evaluate generalization ability thereof.
6. The multi-view stereo matching reconstruction method based on perceptual uniform loss as claimed in claim 1, wherein the evaluation index selection accuracy and integrity in the step 6 can effectively evaluate the efficiency of the algorithm and measure the effect of the matching network.
7. A multi-view stereo matching reconstruction system based on perceptual coherence loss is characterized by comprising:
the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove part of surface information so as to carry out semi-supervised training;
the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map;
the depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;
and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211390106.8A CN115546442A (en) | 2022-11-08 | 2022-11-08 | Multi-view stereo matching reconstruction method and system based on perception consistency loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211390106.8A CN115546442A (en) | 2022-11-08 | 2022-11-08 | Multi-view stereo matching reconstruction method and system based on perception consistency loss |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115546442A true CN115546442A (en) | 2022-12-30 |
Family
ID=84720673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211390106.8A Pending CN115546442A (en) | 2022-11-08 | 2022-11-08 | Multi-view stereo matching reconstruction method and system based on perception consistency loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115546442A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117333758A (en) * | 2023-12-01 | 2024-01-02 | 博创联动科技股份有限公司 | Land route identification system based on big data analysis |
CN117437363A (en) * | 2023-12-20 | 2024-01-23 | 安徽大学 | Large-scale multi-view stereoscopic method based on depth perception iterator |
CN117818712B (en) * | 2024-01-17 | 2024-05-31 | 广州中铁信息工程有限公司 | Visual shunting intelligent management system based on railway station 5G ad hoc network |
-
2022
- 2022-11-08 CN CN202211390106.8A patent/CN115546442A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117333758A (en) * | 2023-12-01 | 2024-01-02 | 博创联动科技股份有限公司 | Land route identification system based on big data analysis |
CN117333758B (en) * | 2023-12-01 | 2024-02-13 | 博创联动科技股份有限公司 | Land route identification system based on big data analysis |
CN117437363A (en) * | 2023-12-20 | 2024-01-23 | 安徽大学 | Large-scale multi-view stereoscopic method based on depth perception iterator |
CN117437363B (en) * | 2023-12-20 | 2024-03-22 | 安徽大学 | Large-scale multi-view stereoscopic method based on depth perception iterator |
CN117818712B (en) * | 2024-01-17 | 2024-05-31 | 广州中铁信息工程有限公司 | Visual shunting intelligent management system based on railway station 5G ad hoc network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111598998B (en) | Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium | |
CN111462206B (en) | Monocular structure light depth imaging method based on convolutional neural network | |
CN113066168B (en) | Multi-view stereo network three-dimensional reconstruction method and system | |
CN115546442A (en) | Multi-view stereo matching reconstruction method and system based on perception consistency loss | |
CN112435309A (en) | Method for enhancing quality and resolution of CT image based on deep learning | |
CN112767467B (en) | Double-image depth estimation method based on self-supervision deep learning | |
WO2020019245A1 (en) | Three-dimensional reconstruction method and apparatus for transparent object, computer device, and storage medium | |
CN115147709B (en) | Underwater target three-dimensional reconstruction method based on deep learning | |
CN111582437B (en) | Construction method of parallax regression depth neural network | |
CN114170311A (en) | Binocular stereo matching method | |
CN114372523A (en) | Binocular matching uncertainty estimation method based on evidence deep learning | |
CN114996814A (en) | Furniture design system based on deep learning and three-dimensional reconstruction | |
CN115359191A (en) | Object three-dimensional reconstruction system based on deep learning | |
CN115222889A (en) | 3D reconstruction method and device based on multi-view image and related equipment | |
CN110889868B (en) | Monocular image depth estimation method combining gradient and texture features | |
CN116310095A (en) | Multi-view three-dimensional reconstruction method based on deep learning | |
CN115587987A (en) | Storage battery defect detection method and device, storage medium and electronic equipment | |
CN115511708A (en) | Depth map super-resolution method and system based on uncertainty perception feature transmission | |
CN109816781B (en) | Multi-view solid geometry method based on image detail and structure enhancement | |
CN117197627B (en) | Multi-mode image fusion method based on high-order degradation model | |
CN114120012A (en) | Stereo matching method based on multi-feature fusion and tree structure cost aggregation | |
CN113780389A (en) | Deep learning semi-supervised dense matching method and system based on consistency constraint | |
CN113705796A (en) | Light field depth acquisition convolutional neural network based on EPI feature enhancement | |
CN117392496A (en) | Target detection method and system based on infrared and visible light image fusion | |
CN117274349A (en) | Transparent object reconstruction method and system based on RGB-D camera consistency depth prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |