CN115546442A - Multi-view stereo matching reconstruction method and system based on perception consistency loss - Google Patents

Multi-view stereo matching reconstruction method and system based on perception consistency loss Download PDF

Info

Publication number
CN115546442A
CN115546442A CN202211390106.8A CN202211390106A CN115546442A CN 115546442 A CN115546442 A CN 115546442A CN 202211390106 A CN202211390106 A CN 202211390106A CN 115546442 A CN115546442 A CN 115546442A
Authority
CN
China
Prior art keywords
dimensional
layer
loss
depth
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211390106.8A
Other languages
Chinese (zh)
Inventor
詹伟达
曹可亮
郝子强
蒋一纯
郭金鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Science and Technology
Original Assignee
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Science and Technology filed Critical Changchun University of Science and Technology
Priority to CN202211390106.8A priority Critical patent/CN115546442A/en
Publication of CN115546442A publication Critical patent/CN115546442A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

A multi-view stereo matching reconstruction method and a system based on perception consistency loss belong to the field of three-dimensional reconstruction, and aim to solve the problems of serious memory occupation and low integrity in the prior art, the method comprises the following steps of 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module; step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points; step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training; step 4, designing a loss function and regressing; step 5, saving the model and testing; step 6, point cloud fusion and reconstruction: and (5) fusing the three-dimensional point clouds obtained in the step (5), checking the fused matching reconstruction effect by using MeshLab software, and selecting an optimal evaluation index to measure the accuracy of the algorithm and the performance of the system.

Description

Multi-view stereo matching reconstruction method and system based on perception consistency loss
Technical Field
The invention relates to a multi-view stereo matching reconstruction method and system based on perception consistency loss, and belongs to the technical field of three-dimensional reconstruction.
Background
The three-dimensional reconstruction is widely applied to the aspects of industrial measurement, intelligent robots, unmanned systems, medical diagnosis, digital city modeling, somatosensory entertainment and the like. The multi-view stereo matching is an end-to-end learning method belonging to a three-dimensional reconstruction technology, belongs to passive three-dimensional reconstruction, and has the characteristics of low cost, simple structure and good practicability. Since each stereo matching method has its own limitations, the integrity of matching and the weight reduction of the network have high requirements in both the conventional method and the learning method. Therefore, optimization of the matching depth map is essential for the effect of subsequent point cloud fusion. In order to obtain effective depth information in better dense reconstruction, the existing multi-view stereo matching reconstruction method is mostly completed based on pixels. However, the reconstruction method has two key problems, namely, the light weight cannot be effectively realized due to long operation time, and the integrity is high while the precision is ensured.
Chinese patent publication No. CN113963117A entitled "a method and apparatus for multi-view three-dimensional reconstruction based on variable convolution depth network", the method comprises inputting a source image and reference images of multiple view angles; then, extracting the features of the input image through a multi-scale feature network constructed by deformable convolution; then, iterative optimization calculation of pixel depth matching and edge processing is carried out by adopting a learning-based patch matching iterative model to obtain an iteratively optimized depth map; and finally, importing the depth map and the source image subjected to iterative optimization into a depth residual error network for optimization to obtain a final depth map, and performing three-dimensional reconstruction to obtain a stereoscopic vision map. The method adopts a supervised learning method, and has serious memory occupation and lower integrity.
Disclosure of Invention
The invention provides a multi-view stereo matching reconstruction method based on perception consistency loss, which aims to solve the problems of serious memory occupation and low integrity in the existing multi-view stereo matching method. The obtained depth map is smoother and more complete, has better reconstruction effect on subsequent point cloud fusion, and better conforms to the detailed observation of objects.
The technical scheme for solving the technical problem is as follows:
the multi-view stereo matching reconstruction method based on the perception consistency loss comprises the following steps:
step 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module, wherein each module consists of a convolution layer, a regularization function and an activation function; the pyramid volume regularization module performs downsampling extraction and upsampling integration regularization on the extracted features to construct a 3D cost volume so as to obtain a dense depth map; finally, the double-correction-layer module simply filters the obtained depth map to remove redundant information, optimizes depth map combination and retains useful information;
step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points;
step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training;
step 4, designing a loss function and regressing: performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss to obtain effective earth surface depth values, and transmitting the effective earth surface depth values to depth regression loss as supervision until the training times reach a set threshold or the value of a loss function reaches a set range, namely the model parameters are considered to be trained and stored;
and step 5, storing the model and testing: solidifying the finally determined model parameters, and directly inputting the image into a network to obtain a final three-dimensional point cloud when point cloud fusion and reconstruction operation are required;
step 6, point cloud fusion and reconstruction: fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; and meanwhile, the quality of the model is not further verified, and the optimal evaluation index is selected to measure the accuracy of the algorithm and the performance of the system.
The feature extraction module in the step 1 comprises eight convolution layers, wherein downsampling operation is carried out on a convolution layer I, a convolution layer II and a convolution layer III, the feature diagram is downsampled by reducing the size of the feature diagram by half, a convolution layer IV, a convolution layer V and a convolution layer VI, the size of the feature diagram is reduced by half, and finally the feature diagram is output by a convolution layer seven and a convolution layer eight. The output of each convolution layer of the characteristic extraction module is subjected to regularization operation; the pyramid volume regularization module comprises ten convolution layers, wherein a first time of downsampling is carried out on the extracted features of a first three-dimensional convolution layer pair and a second time of downsampling is carried out on the extracted features of a third three-dimensional convolution layer pair and a fourth three-dimensional convolution layer pair, and feature information after convolution is output by a fifth three-dimensional convolution layer, a sixth three-dimensional convolution layer and a seventh three-dimensional convolution layer. The three-dimensional deconvolution layer performs up-sampling on one pair of inputs, outputs deconvoluted feature information through an activation function, performs up-sampling on the outputs of two pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, performs up-sampling on the outputs of three pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, and finally integrates the feature information to construct a 3D cost volume to obtain a dense depth map. The output of each convolution layer of the pyramid volume regularization module is regularized; the double-correction-layer module comprises four convolution layers, wherein the convolution layer I, the convolution layer II and the convolution layer III are used for simply filtering and removing redundant information from the obtained depth map, and the convolution layer IV is used for performing up-sampling and channel conversion on input to optimize depth map combination and retain useful information. The output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of the two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.
In the step 3, a DTU is used for a data set in a training process; estimating the pose of a reconstructed object from a multi-frame image to perform semi-supervised training by calculating internal and external parameters of a camera in a data set; the method is characterized in that multi-frame pictures, calibrated camera parameters and depth information are used as input of the whole network, earth surface information matched with loss calculation is used as a label, and the problem that most of the fields of three-dimensional reconstruction can only be supervised training is solved.
In the step 4, in the training process, a loss function selects a combination of data enhancement consistent loss and depth perception loss; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, and needs a detailed part of the authenticity of the surface of the protrusion, so that the matching reconstruction effect of the fused point cloud is improved.
After the model is saved in the step 5, an ETH3D data set and a Tank & Temples data set can be used for testing to evaluate the generalization ability of the model.
And 6, evaluating the accuracy and the integrity of the index selection, effectively evaluating the efficiency of the algorithm and measuring the action of the matching network.
The invention also provides a multi-view stereo matching reconstruction system based on perception consistency loss, which comprises the following steps:
the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove partial earth surface information so as to carry out semi-supervised training;
the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; and inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map.
The depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;
and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into a three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.
The invention has the following beneficial effects:
1. the method for achieving semi-supervision by using the depth image subjected to loss calculation as supervision can reduce the occupation and the calculation amount of the memory, and regresses the three-dimensional points of the corresponding pixel back projection to the actual meeting position in the three-dimensional world, thereby enhancing the robustness of the network. And the computed smooth depth map ensures that the integrity of point cloud matching reconstruction is higher and the effect is better.
2. The double correction layers are used in the backbone network for edge check, the depth information value is utilized to the maximum extent, and the problem that edge shielding information is wrong can be effectively solved.
3. The whole training network uses splicing operation on the two branches to mix effective depth information of a correction layer and a loss layer, so that the network has stronger computing loss capability on images with two different depths; and deep regression is added in the loss function, so that the number of parameters of the network is small, the whole network is simple in structure, and the reconstruction precision is high.
Drawings
Fig. 1 is a flowchart of a multi-view stereo matching reconstruction method based on perceptual uniform loss.
Fig. 2 is a network structure diagram of a multi-view stereo matching reconstruction method based on perceptual consistency loss.
Fig. 3 is a specific component of the feature extraction module according to the present invention.
Fig. 4 is a specific composition of the pyramid volume regularization module of the present invention.
Fig. 5 is a specific composition of the dual calibration layer module according to the present invention.
Fig. 6 is a schematic structural diagram of a multi-view stereo matching reconstruction system based on perceptual uniform loss according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the multi-view stereo matching reconstruction method based on perceptual uniform loss specifically includes the following steps:
step 1, constructing a network model. As shown in fig. 2, the entire network includes a feature extraction module, a pyramid cost volume regularization module, and a double correction layer module, and each module is composed of a convolution layer, regularization, and activation functions. As shown in fig. 3, in the feature extraction module, downsampling is performed on the convolutional layer one, convolutional layer two, and convolutional layer three, the feature map is reduced by half, the convolutional layer four, convolutional layer five, and convolutional layer six, downsampling is performed on the feature map, the feature map is reduced by half, and finally, the feature map is output by the convolutional layer seven and convolutional layer eight. The output of each convolution layer of the characteristic extraction module is subjected to regularization operation; as shown in fig. 4, in the pyramid volume regularization module, the first downsampling is performed on the extracted features by the three-dimensional convolution layer one and the three-dimensional convolution layer two pairs, the second downsampling is performed on the extracted features by the three-dimensional convolution layer three and the three-dimensional convolution layer four pairs, and feature information after convolution is output through the three-dimensional convolution layer five, the three-dimensional convolution layer six and the three-dimensional convolution layer seven. The three-dimensional deconvolution layer performs up-sampling on one pair of inputs, outputs deconvoluted feature information through an activation function, performs up-sampling on the outputs of two pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, performs up-sampling on the outputs of three pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, and finally integrates the feature information to construct a 3D cost volume to obtain a dense depth map. The output of each convolution layer of the pyramid volume regularization module is regularized; as shown in fig. 5, in the double correction layer module, the depth map obtained by the convolutional layer one, convolutional layer two, and convolutional layer three is simply filtered to remove redundant information, and the convolutional layer four is input to perform upsampling and channel conversion to optimize depth map combination and retain useful information. The output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.
And 2, preparing a data set. The original 3D points are used from the reference view to estimate a dense depth map D of the three-dimensional structure by semi-supervised training of the data set, taking as input the source image and the reference image.
And 3, training a network model. And (3) inputting the data set prepared in the step (2) into the network model constructed in the step (1) for training. And the data set uses DTUs during the training process; estimating the pose of a reconstructed object from a plurality of frames of images to carry out self-supervision learning training by calculating internal and external parameters of a camera in a data set; the method is characterized in that multi-frame pictures, calibrated camera parameters and depth information are used as input of the whole network, earth surface information matched with loss calculation is used as a label, and the problem that most of the fields of three-dimensional reconstruction can only be supervised training is solved.
And 4, designing a loss function and regressing. As for the loss designed in fig. 2, the depth information obtained by data enhancement consistent loss and the depth information obtained by depth perception loss are subjected to parallax comparison to obtain an effective earth surface depth value, the effective earth surface depth value is used as supervision and transmitted to the depth regression loss, the training of the model parameters can be considered to be completed until the training times reach a set threshold value or the value of the loss function reaches a set range, and the model parameters are stored; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, can project the authenticity detail part of the surface of the object, and improves the matching reconstruction effect of the fused point cloud.
And 5, storing the model and testing. And (3) solidifying the finally determined model parameters, and directly inputting the image into a network to obtain the final three-dimensional point cloud when point cloud fusion and reconstruction operation are required. The model is used for testing an ETH3D data set and a Tank & templates data set, and the generalization capability of the model is evaluated.
And 6, point cloud fusion and reconstruction. Fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; meanwhile, the accuracy and the integrity of the optimal evaluation index selection can effectively evaluate the efficiency of the algorithm and measure the action of the matching network.
The embodiment is as follows:
as shown in fig. 1, the multi-view stereo matching reconstruction method based on perceptual consistency loss specifically includes the following steps:
step 1, constructing a network model. As shown in fig. 2, the entire network includes a feature extraction module, a pyramid volume regularization module, and a double correction layer module, and each module is composed of a convolution layer, regularization, and an activation function. As shown in fig. 3, the feature extraction module includes eight two-dimensional convolution layers, the convolution kernel size of convolution layer three and convolution layer six is 5 × 5, and the step length and the padding are both 2; convolution kernels of the convolution layer I, the convolution layer II, the convolution layer IV, the convolution layer V, the convolution layer VII and the convolution layer VIII are 3 x 3, and step length and filling are all 1; as shown in fig. 4, the pyramid volume regularization module contains ten three-dimensional convolutional layers, the convolution kernels of all layers are 3 × 3 × 3, and the step sizes and the padding of the three-dimensional convolutional layer one, three-dimensional convolutional layer three, three-dimensional convolutional layer five, and three-dimensional convolutional layer seven are 2 and 1. The step length and the filling of the three-dimensional deconvolution layer I, the three-dimensional deconvolution layer II and the three-dimensional deconvolution layer III are all 1; as shown in fig. 5, the first correction layer and the second correction layer in the dual correction layer module have the same structure and include four two-dimensional convolution layers, convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer are all 3 × 3, step length and padding are all 1, and the splicing dimension is 1. The linear rectification function regularization function is defined as follows:
Figure BDA0003931675930000061
Figure BDA0003931675930000062
in the formula, X and y are training samples and corresponding labels, and omega is a weight coefficient vector; j (.) is an objective function, and omega (omega) is a penalty term; the parameter alpha controls the regularization strength, so that the complexity of the model is controlled, and overfitting is reduced.
And 2, preparing a data set. The data set uses a DTU data set. The data set consists of 124 different objects or scenes, each object takes 49 views, each view has 7 different brightnesses, so there are 343 pictures inside each object or scene folder, and the data set also carries the training image set with the depth map truth values. The resolution of each image is 1600 × 1200.
And 3, training a network model. The network carries out semi-supervised training on the pose of the multi-frame image estimation reconstruction object. The semi-supervised training is to use multi-frame pictures, calibrated camera parameters and a small amount of depth information as the input of the whole network, and use surface information matched with loss calculation as a label, so that the problem that most of the fields of three-dimensional reconstruction can only be subjected to supervised training is solved, and the system occupies a small memory and has high integrity.
And 4, designing a loss function and regressing. As the loss designed in fig. 2, by performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss, an effective earth surface depth value is obtained and is used as supervision to be transmitted to the depth regression loss to optimize a depth map, and thus a better matching reconstruction effect is achieved. The data enhancement coherency loss is defined as follows:
Figure BDA0003931675930000071
wherein, M τθ Representing the mask unobstructed at τ θ, D represents the predicted value of the regular forward channel of the corrected image,
Figure BDA0003931675930000072
extended picture prediction value expressed as a corrected picture by minimizing D and
Figure BDA0003931675930000073
betweenTo ensure data enhanced consistency.
The depth perception loss is defined as follows:
p′=K′(R′R -1 (D(p)K -1 p-t)+t′)
P=D(p)K -1 p-t
P′=D(p′)K′ -1 p′-t′
Figure BDA0003931675930000074
p 'is expressed as the estimated corresponding pixel of P, the back projected three-dimensional depth information perception points D (P) and C (q) of P and P' are expressed as the predicted depth and probability value of P pixel, and epsilon h Threshold representing highly reliable depth prediction, e ω Representing a threshold for filtering out mismatching pixels. Wherein the basis is ∈ h And e ω The depth perception prediction values obtained by the two thresholds can approximately detect edges, occlusion and wrong non-occlusion areas, and the areas can form wrong corresponding relations.
The depth regression loss is defined as follows:
L REG =λ 1 L SSIM2 L Smooth
wherein L is SSIM And L Smooth For structural similarity loss and depth smoothing loss, both are conditional losses for common depth estimation, so L SSIM Can be expressed as:
Figure BDA0003931675930000081
x and y respectively represent pixel points of a window with the size of NxN in the two images, and mu x And mu y Respectively representing the mean values of x and y, which can be used as brightness estimation; sigma x And σ y The variances of x and y are respectively expressed and can be used as contrast estimation; sigma xy Representing the covariance of x and y, which can be used as a structural similarity measure. c. C 1 And c 2 For a minimum parameter, a denominator of 0 can be avoided, typically 0.01 and 0.03 respectively. L is Smooth Can be expressed as:
Figure BDA0003931675930000082
wherein
Figure BDA0003931675930000083
And with
Figure BDA0003931675930000084
Are continuous, x and y are pixels of the image.
From the above four loss functions, a loss function with a weight as a whole can be constructed as follows:
L=L R3 L DA4 L DPR =λ 1 L SSIM2 L Smooth3 L DA4 L DPR
the weights can be assigned as follows through a large number of comparisons and a plurality of tests: lambda [ alpha ] 1 =0.2,λ 2 =0.0067,λ 3 =0.1,λ 4 =0.8。
The training times are set to be 16, the upper limit of the number of the pictures input to the network each time is mainly determined according to the performance of a computer graphic processor, and generally, the number of the pictures input to the network each time is within a range of 1-4, so that the network training can be more stable, the training result is better, and the rapid fitting of the network is ensured. In the training process, the learning rate of the parameters is 0.001, so that the fast fitting of the network can be ensured, and the overfitting of the network cannot be caused. The algorithm of the parameter optimizer selects an adaptive matrix estimation algorithm, and the method has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable. The threshold of the penalty function is set to about 0.0003, and less than 0.0003 can be considered that the training of the entire network is substantially complete.
And 5, saving the model and testing. Training is performed using the DTU data set and the training model is saved. The model is used for testing an ETH3D data set and a Tank & Temples data set, and the generalization capability of the ETH3D data set and the Tank & Temples data set is evaluated.
And 6, point cloud fusion and reconstruction. Subsequent texture enhancement and grid mapping belong to the traditional algorithm, the point cloud after depth map fusion can be directly used for matching, and the reconstruction effect can be directly checked by using MeshLab software to verify the quality of a matching network; meanwhile, as shown in table 1, the accuracy and integrity of the optimal evaluation index selection can effectively evaluate the efficiency of the algorithm and measure the effect of the matching network. The formula for accuracy is as follows:
Figure BDA0003931675930000091
wherein G is the earth surface frame value model, G is a point in the earth surface frame value model, R is the reconstruction model, and R is a point in the earth reconstruction model. The term "is an Affyson bracket, and if the condition in the Affyson bracket is satisfied, it is 1, and if not, it is zero. d is a distance coefficient.
The formula for the integrity is as follows:
Figure BDA0003931675930000092
wherein G is the earth surface frame value model, G is a point in the earth surface frame value model, R is the reconstruction model, and R is a point in the earth reconstruction model. The term "1" means an Aftersen bracket, and when the condition in the Aftersen bracket is satisfied, it is zero. d is a distance coefficient.
As shown in fig. 6, the multi-view stereo matching reconstruction system based on perceptual uniform loss provided by the present invention includes:
the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove partial earth surface information so as to carry out semi-supervised training;
the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; and inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map.
The depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;
and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.
The implementation of convolution, activation function, concatenation operation, regularization operation, gaussian filtering is an algorithm well known to those skilled in the art, and the specific procedures and methods can be referred to in the corresponding textbooks or technical literature.
According to the invention, by constructing the multi-view stereo matching network based on the perception consistency loss, the depth map can be directly obtained by using the source image and the reference image and fused into the point cloud without other intermediate steps, and the algorithm of manually designing stereo matching in the traditional method is avoided. The feasibility and the superiority of the method are further verified by calculating the relevant indexes of the depth map obtained by the prior art. The correlation indexes of the methods proposed by the prior art and the present invention are shown in table 1:
TABLE 1 comparison of relevant indexes of the prior art and the method proposed by the present invention
Figure BDA0003931675930000101
It is worth noting that the learning method of the method and the system is an end-to-end learning method in the semi-supervised field, and the designed loss function belongs to an important realization function of the invention.

Claims (7)

1. A multi-view stereo matching reconstruction method based on perception consistency loss is characterized by comprising the following steps:
step 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module, wherein each module consists of a convolution layer, a regularization function and an activation function; the pyramid volume regularization module is used for performing down-sampling extraction and up-sampling integration regularization on the extracted features to construct a 3D cost volume so as to obtain a dense depth map; finally, the double-correction-layer module simply filters the obtained depth map to remove redundant information, optimizes depth map combination and retains useful information;
step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points;
step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training;
step 4, designing a loss function and regressing: performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss to obtain effective earth surface depth values, and transmitting the effective earth surface depth values to depth regression loss as supervision until the training times reach a set threshold or the value of a loss function reaches a set range, namely the model parameters are considered to be trained and stored;
and 5, saving the model and testing: solidifying the finally determined model parameters, and directly inputting the images into a network to obtain the final three-dimensional point cloud when point cloud fusion and reconstruction operation are required;
step 6, point cloud fusion and reconstruction: fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; meanwhile, in order to further verify the quality of the model, the optimal evaluation index is selected to measure the accuracy of the algorithm and the performance of the system.
2. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the feature extraction module in the step 1 includes eight convolutional layers, wherein a first convolutional layer, a second convolutional layer, and a third convolutional layer perform downsampling, the feature map size is reduced by half, a fourth convolutional layer, a fifth convolutional layer, and a sixth convolutional layer perform downsampling on the feature map, the feature map size is reduced by half again, and finally the feature map size is output by a seventh convolutional layer and an eighth convolutional layer; the output of each convolution layer of the characteristic extraction module is subjected to regularization operation; the pyramid volume regularization module comprises ten convolutional layers, wherein the first downsampling is carried out on the extracted features by a three-dimensional convolutional layer I and a three-dimensional convolutional layer II, the second downsampling is carried out on the extracted features by a three-dimensional convolutional layer III and a three-dimensional convolutional layer IV, and feature information after convolution is output by a three-dimensional convolutional layer V, a three-dimensional convolutional layer six and a three-dimensional convolutional layer seven; the method comprises the following steps that a pair of inputs of a three-dimensional deconvolution layer are subjected to up-sampling, feature information after deconvolution is output through an activation function, outputs of two pairs of three-dimensional deconvolution layers of the three-dimensional deconvolution layer are subjected to up-sampling, feature information after deconvolution is output through the activation function, outputs of three pairs of three-dimensional deconvolution layers of four three-dimensional convolution layers are subjected to up-sampling, feature information after deconvolution is output through the activation function, and finally the feature information is integrated to construct a 3D cost volume to obtain a dense depth map; the output of each convolution layer of the pyramid volume regularization module is regularized; the double-correction-layer module comprises four convolution layers, wherein the convolution layer I, the convolution layer II and the convolution layer III carry out simple filtering on the obtained depth map to remove redundant information, and the convolution layer IV carries out up-sampling and channel conversion on input to optimize depth map combination and reserve useful information; the output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of the two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.
3. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the DTU is used for the data set in the training process in the step 3; estimating the pose of a reconstructed object from a plurality of frames of images to carry out semi-supervised training by calculating internal and external parameters of a camera in a data set; the method comprises the steps of taking multi-frame pictures, calibrated camera parameters and depth information as input of the whole network, taking earth surface information matched with loss calculation as a label, and solving the problem that most of the fields of three-dimensional reconstruction can only be supervised training.
4. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the loss function in the step 4 selects a combination of the data enhancement uniform loss and the depth perception loss in a training process; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, and needs a detailed part of the authenticity of the surface of the protrusion, so that the matching reconstruction effect of the fused point cloud is improved.
5. The method for reconstructing multi-view stereo matching based on perceptual uniform loss as claimed in claim 1, wherein in the step 5, after the model is saved, the ETH3D dataset and the Tank & Temples dataset are used for testing to evaluate generalization ability thereof.
6. The multi-view stereo matching reconstruction method based on perceptual uniform loss as claimed in claim 1, wherein the evaluation index selection accuracy and integrity in the step 6 can effectively evaluate the efficiency of the algorithm and measure the effect of the matching network.
7. A multi-view stereo matching reconstruction system based on perceptual coherence loss is characterized by comprising:
the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove part of surface information so as to carry out semi-supervised training;
the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map;
the depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;
and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.
CN202211390106.8A 2022-11-08 2022-11-08 Multi-view stereo matching reconstruction method and system based on perception consistency loss Pending CN115546442A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211390106.8A CN115546442A (en) 2022-11-08 2022-11-08 Multi-view stereo matching reconstruction method and system based on perception consistency loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211390106.8A CN115546442A (en) 2022-11-08 2022-11-08 Multi-view stereo matching reconstruction method and system based on perception consistency loss

Publications (1)

Publication Number Publication Date
CN115546442A true CN115546442A (en) 2022-12-30

Family

ID=84720673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211390106.8A Pending CN115546442A (en) 2022-11-08 2022-11-08 Multi-view stereo matching reconstruction method and system based on perception consistency loss

Country Status (1)

Country Link
CN (1) CN115546442A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333758A (en) * 2023-12-01 2024-01-02 博创联动科技股份有限公司 Land route identification system based on big data analysis
CN117437363A (en) * 2023-12-20 2024-01-23 安徽大学 Large-scale multi-view stereoscopic method based on depth perception iterator
CN117818712B (en) * 2024-01-17 2024-05-31 广州中铁信息工程有限公司 Visual shunting intelligent management system based on railway station 5G ad hoc network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333758A (en) * 2023-12-01 2024-01-02 博创联动科技股份有限公司 Land route identification system based on big data analysis
CN117333758B (en) * 2023-12-01 2024-02-13 博创联动科技股份有限公司 Land route identification system based on big data analysis
CN117437363A (en) * 2023-12-20 2024-01-23 安徽大学 Large-scale multi-view stereoscopic method based on depth perception iterator
CN117437363B (en) * 2023-12-20 2024-03-22 安徽大学 Large-scale multi-view stereoscopic method based on depth perception iterator
CN117818712B (en) * 2024-01-17 2024-05-31 广州中铁信息工程有限公司 Visual shunting intelligent management system based on railway station 5G ad hoc network

Similar Documents

Publication Publication Date Title
CN111598998B (en) Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium
CN111462206B (en) Monocular structure light depth imaging method based on convolutional neural network
CN113066168B (en) Multi-view stereo network three-dimensional reconstruction method and system
CN115546442A (en) Multi-view stereo matching reconstruction method and system based on perception consistency loss
CN112435309A (en) Method for enhancing quality and resolution of CT image based on deep learning
CN112767467B (en) Double-image depth estimation method based on self-supervision deep learning
WO2020019245A1 (en) Three-dimensional reconstruction method and apparatus for transparent object, computer device, and storage medium
CN115147709B (en) Underwater target three-dimensional reconstruction method based on deep learning
CN111582437B (en) Construction method of parallax regression depth neural network
CN114170311A (en) Binocular stereo matching method
CN114372523A (en) Binocular matching uncertainty estimation method based on evidence deep learning
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN115359191A (en) Object three-dimensional reconstruction system based on deep learning
CN115222889A (en) 3D reconstruction method and device based on multi-view image and related equipment
CN110889868B (en) Monocular image depth estimation method combining gradient and texture features
CN116310095A (en) Multi-view three-dimensional reconstruction method based on deep learning
CN115587987A (en) Storage battery defect detection method and device, storage medium and electronic equipment
CN115511708A (en) Depth map super-resolution method and system based on uncertainty perception feature transmission
CN109816781B (en) Multi-view solid geometry method based on image detail and structure enhancement
CN117197627B (en) Multi-mode image fusion method based on high-order degradation model
CN114120012A (en) Stereo matching method based on multi-feature fusion and tree structure cost aggregation
CN113780389A (en) Deep learning semi-supervised dense matching method and system based on consistency constraint
CN113705796A (en) Light field depth acquisition convolutional neural network based on EPI feature enhancement
CN117392496A (en) Target detection method and system based on infrared and visible light image fusion
CN117274349A (en) Transparent object reconstruction method and system based on RGB-D camera consistency depth prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination