CN115546442A

CN115546442A - Multi-view stereo matching reconstruction method and system based on perception consistency loss

Info

Publication number: CN115546442A
Application number: CN202211390106.8A
Authority: CN
Inventors: 詹伟达; 曹可亮; 郝子强; 蒋一纯; 郭金鑫
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2022-12-30

Abstract

A multi-view stereo matching reconstruction method and a system based on perception consistency loss belong to the field of three-dimensional reconstruction, and aim to solve the problems of serious memory occupation and low integrity in the prior art, the method comprises the following steps of 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module; step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points; step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training; step 4, designing a loss function and regressing; step 5, saving the model and testing; step 6, point cloud fusion and reconstruction: and (5) fusing the three-dimensional point clouds obtained in the step (5), checking the fused matching reconstruction effect by using MeshLab software, and selecting an optimal evaluation index to measure the accuracy of the algorithm and the performance of the system.

Description

Multi-view stereo matching reconstruction method and system based on perception consistency loss

Technical Field

The invention relates to a multi-view stereo matching reconstruction method and system based on perception consistency loss, and belongs to the technical field of three-dimensional reconstruction.

Background

The three-dimensional reconstruction is widely applied to the aspects of industrial measurement, intelligent robots, unmanned systems, medical diagnosis, digital city modeling, somatosensory entertainment and the like. The multi-view stereo matching is an end-to-end learning method belonging to a three-dimensional reconstruction technology, belongs to passive three-dimensional reconstruction, and has the characteristics of low cost, simple structure and good practicability. Since each stereo matching method has its own limitations, the integrity of matching and the weight reduction of the network have high requirements in both the conventional method and the learning method. Therefore, optimization of the matching depth map is essential for the effect of subsequent point cloud fusion. In order to obtain effective depth information in better dense reconstruction, the existing multi-view stereo matching reconstruction method is mostly completed based on pixels. However, the reconstruction method has two key problems, namely, the light weight cannot be effectively realized due to long operation time, and the integrity is high while the precision is ensured.

Chinese patent publication No. CN113963117A entitled "a method and apparatus for multi-view three-dimensional reconstruction based on variable convolution depth network", the method comprises inputting a source image and reference images of multiple view angles; then, extracting the features of the input image through a multi-scale feature network constructed by deformable convolution; then, iterative optimization calculation of pixel depth matching and edge processing is carried out by adopting a learning-based patch matching iterative model to obtain an iteratively optimized depth map; and finally, importing the depth map and the source image subjected to iterative optimization into a depth residual error network for optimization to obtain a final depth map, and performing three-dimensional reconstruction to obtain a stereoscopic vision map. The method adopts a supervised learning method, and has serious memory occupation and lower integrity.

Disclosure of Invention

The invention provides a multi-view stereo matching reconstruction method based on perception consistency loss, which aims to solve the problems of serious memory occupation and low integrity in the existing multi-view stereo matching method. The obtained depth map is smoother and more complete, has better reconstruction effect on subsequent point cloud fusion, and better conforms to the detailed observation of objects.

The technical scheme for solving the technical problem is as follows:

the multi-view stereo matching reconstruction method based on the perception consistency loss comprises the following steps:

step 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module, wherein each module consists of a convolution layer, a regularization function and an activation function; the pyramid volume regularization module performs downsampling extraction and upsampling integration regularization on the extracted features to construct a 3D cost volume so as to obtain a dense depth map; finally, the double-correction-layer module simply filters the obtained depth map to remove redundant information, optimizes depth map combination and retains useful information;

step 2, preparing a data set: using a source image and a reference image as data input, and estimating a dense depth map D of a three-dimensional structure from a reference view by using original 3D points;

step 3, training a network model: inputting the data set prepared in the step 2 into the network model constructed in the step 1 for training;

step 4, designing a loss function and regressing: performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss to obtain effective earth surface depth values, and transmitting the effective earth surface depth values to depth regression loss as supervision until the training times reach a set threshold or the value of a loss function reaches a set range, namely the model parameters are considered to be trained and stored;

and step 5, storing the model and testing: solidifying the finally determined model parameters, and directly inputting the image into a network to obtain a final three-dimensional point cloud when point cloud fusion and reconstruction operation are required;

step 6, point cloud fusion and reconstruction: fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; and meanwhile, the quality of the model is not further verified, and the optimal evaluation index is selected to measure the accuracy of the algorithm and the performance of the system.

The feature extraction module in the step 1 comprises eight convolution layers, wherein downsampling operation is carried out on a convolution layer I, a convolution layer II and a convolution layer III, the feature diagram is downsampled by reducing the size of the feature diagram by half, a convolution layer IV, a convolution layer V and a convolution layer VI, the size of the feature diagram is reduced by half, and finally the feature diagram is output by a convolution layer seven and a convolution layer eight. The output of each convolution layer of the characteristic extraction module is subjected to regularization operation; the pyramid volume regularization module comprises ten convolution layers, wherein a first time of downsampling is carried out on the extracted features of a first three-dimensional convolution layer pair and a second time of downsampling is carried out on the extracted features of a third three-dimensional convolution layer pair and a fourth three-dimensional convolution layer pair, and feature information after convolution is output by a fifth three-dimensional convolution layer, a sixth three-dimensional convolution layer and a seventh three-dimensional convolution layer. The three-dimensional deconvolution layer performs up-sampling on one pair of inputs, outputs deconvoluted feature information through an activation function, performs up-sampling on the outputs of two pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, performs up-sampling on the outputs of three pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, and finally integrates the feature information to construct a 3D cost volume to obtain a dense depth map. The output of each convolution layer of the pyramid volume regularization module is regularized; the double-correction-layer module comprises four convolution layers, wherein the convolution layer I, the convolution layer II and the convolution layer III are used for simply filtering and removing redundant information from the obtained depth map, and the convolution layer IV is used for performing up-sampling and channel conversion on input to optimize depth map combination and retain useful information. The output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of the two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.

In the step 3, a DTU is used for a data set in a training process; estimating the pose of a reconstructed object from a multi-frame image to perform semi-supervised training by calculating internal and external parameters of a camera in a data set; the method is characterized in that multi-frame pictures, calibrated camera parameters and depth information are used as input of the whole network, earth surface information matched with loss calculation is used as a label, and the problem that most of the fields of three-dimensional reconstruction can only be supervised training is solved.

In the step 4, in the training process, a loss function selects a combination of data enhancement consistent loss and depth perception loss; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, and needs a detailed part of the authenticity of the surface of the protrusion, so that the matching reconstruction effect of the fused point cloud is improved.

After the model is saved in the step 5, an ETH3D data set and a Tank & Temples data set can be used for testing to evaluate the generalization ability of the model.

And 6, evaluating the accuracy and the integrity of the index selection, effectively evaluating the efficiency of the algorithm and measuring the action of the matching network.

The invention also provides a multi-view stereo matching reconstruction system based on perception consistency loss, which comprises the following steps:

the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove partial earth surface information so as to carry out semi-supervised training;

the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; and inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map.

The depth map optimization unit is used for calculating the matched earth surface information as a label to participate in training through a loss function designed in the method for the depth map output by training, and further obtaining a dense depth map;

and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into a three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.

The invention has the following beneficial effects:

1. the method for achieving semi-supervision by using the depth image subjected to loss calculation as supervision can reduce the occupation and the calculation amount of the memory, and regresses the three-dimensional points of the corresponding pixel back projection to the actual meeting position in the three-dimensional world, thereby enhancing the robustness of the network. And the computed smooth depth map ensures that the integrity of point cloud matching reconstruction is higher and the effect is better.

2. The double correction layers are used in the backbone network for edge check, the depth information value is utilized to the maximum extent, and the problem that edge shielding information is wrong can be effectively solved.

3. The whole training network uses splicing operation on the two branches to mix effective depth information of a correction layer and a loss layer, so that the network has stronger computing loss capability on images with two different depths; and deep regression is added in the loss function, so that the number of parameters of the network is small, the whole network is simple in structure, and the reconstruction precision is high.

Drawings

Fig. 1 is a flowchart of a multi-view stereo matching reconstruction method based on perceptual uniform loss.

Fig. 2 is a network structure diagram of a multi-view stereo matching reconstruction method based on perceptual consistency loss.

Fig. 3 is a specific component of the feature extraction module according to the present invention.

Fig. 4 is a specific composition of the pyramid volume regularization module of the present invention.

Fig. 5 is a specific composition of the dual calibration layer module according to the present invention.

Fig. 6 is a schematic structural diagram of a multi-view stereo matching reconstruction system based on perceptual uniform loss according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the multi-view stereo matching reconstruction method based on perceptual uniform loss specifically includes the following steps:

step 1, constructing a network model. As shown in fig. 2, the entire network includes a feature extraction module, a pyramid cost volume regularization module, and a double correction layer module, and each module is composed of a convolution layer, regularization, and activation functions. As shown in fig. 3, in the feature extraction module, downsampling is performed on the convolutional layer one, convolutional layer two, and convolutional layer three, the feature map is reduced by half, the convolutional layer four, convolutional layer five, and convolutional layer six, downsampling is performed on the feature map, the feature map is reduced by half, and finally, the feature map is output by the convolutional layer seven and convolutional layer eight. The output of each convolution layer of the characteristic extraction module is subjected to regularization operation; as shown in fig. 4, in the pyramid volume regularization module, the first downsampling is performed on the extracted features by the three-dimensional convolution layer one and the three-dimensional convolution layer two pairs, the second downsampling is performed on the extracted features by the three-dimensional convolution layer three and the three-dimensional convolution layer four pairs, and feature information after convolution is output through the three-dimensional convolution layer five, the three-dimensional convolution layer six and the three-dimensional convolution layer seven. The three-dimensional deconvolution layer performs up-sampling on one pair of inputs, outputs deconvoluted feature information through an activation function, performs up-sampling on the outputs of two pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, performs up-sampling on the outputs of three pairs of three-dimensional convolution layers of the three-dimensional deconvolution layer, outputs deconvoluted feature information through the activation function, and finally integrates the feature information to construct a 3D cost volume to obtain a dense depth map. The output of each convolution layer of the pyramid volume regularization module is regularized; as shown in fig. 5, in the double correction layer module, the depth map obtained by the convolutional layer one, convolutional layer two, and convolutional layer three is simply filtered to remove redundant information, and the convolutional layer four is input to perform upsampling and channel conversion to optimize depth map combination and retain useful information. The output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.

And 2, preparing a data set. The original 3D points are used from the reference view to estimate a dense depth map D of the three-dimensional structure by semi-supervised training of the data set, taking as input the source image and the reference image.

And 3, training a network model. And (3) inputting the data set prepared in the step (2) into the network model constructed in the step (1) for training. And the data set uses DTUs during the training process; estimating the pose of a reconstructed object from a plurality of frames of images to carry out self-supervision learning training by calculating internal and external parameters of a camera in a data set; the method is characterized in that multi-frame pictures, calibrated camera parameters and depth information are used as input of the whole network, earth surface information matched with loss calculation is used as a label, and the problem that most of the fields of three-dimensional reconstruction can only be supervised training is solved.

And 4, designing a loss function and regressing. As for the loss designed in fig. 2, the depth information obtained by data enhancement consistent loss and the depth information obtained by depth perception loss are subjected to parallax comparison to obtain an effective earth surface depth value, the effective earth surface depth value is used as supervision and transmitted to the depth regression loss, the training of the model parameters can be considered to be completed until the training times reach a set threshold value or the value of the loss function reaches a set range, and the model parameters are stored; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, can project the authenticity detail part of the surface of the object, and improves the matching reconstruction effect of the fused point cloud.

And 5, storing the model and testing. And (3) solidifying the finally determined model parameters, and directly inputting the image into a network to obtain the final three-dimensional point cloud when point cloud fusion and reconstruction operation are required. The model is used for testing an ETH3D data set and a Tank & templates data set, and the generalization capability of the model is evaluated.

And 6, point cloud fusion and reconstruction. Fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; meanwhile, the accuracy and the integrity of the optimal evaluation index selection can effectively evaluate the efficiency of the algorithm and measure the action of the matching network.

The embodiment is as follows:

as shown in fig. 1, the multi-view stereo matching reconstruction method based on perceptual consistency loss specifically includes the following steps:

step 1, constructing a network model. As shown in fig. 2, the entire network includes a feature extraction module, a pyramid volume regularization module, and a double correction layer module, and each module is composed of a convolution layer, regularization, and an activation function. As shown in fig. 3, the feature extraction module includes eight two-dimensional convolution layers, the convolution kernel size of convolution layer three and convolution layer six is 5 × 5, and the step length and the padding are both 2; convolution kernels of the convolution layer I, the convolution layer II, the convolution layer IV, the convolution layer V, the convolution layer VII and the convolution layer VIII are 3 x 3, and step length and filling are all 1; as shown in fig. 4, the pyramid volume regularization module contains ten three-dimensional convolutional layers, the convolution kernels of all layers are 3 × 3 × 3, and the step sizes and the padding of the three-dimensional convolutional layer one, three-dimensional convolutional layer three, three-dimensional convolutional layer five, and three-dimensional convolutional layer seven are 2 and 1. The step length and the filling of the three-dimensional deconvolution layer I, the three-dimensional deconvolution layer II and the three-dimensional deconvolution layer III are all 1; as shown in fig. 5, the first correction layer and the second correction layer in the dual correction layer module have the same structure and include four two-dimensional convolution layers, convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer are all 3 × 3, step length and padding are all 1, and the splicing dimension is 1. The linear rectification function regularization function is defined as follows:

in the formula, X and y are training samples and corresponding labels, and omega is a weight coefficient vector; j (.) is an objective function, and omega (omega) is a penalty term; the parameter alpha controls the regularization strength, so that the complexity of the model is controlled, and overfitting is reduced.

And 2, preparing a data set. The data set uses a DTU data set. The data set consists of 124 different objects or scenes, each object takes 49 views, each view has 7 different brightnesses, so there are 343 pictures inside each object or scene folder, and the data set also carries the training image set with the depth map truth values. The resolution of each image is 1600 × 1200.

And 3, training a network model. The network carries out semi-supervised training on the pose of the multi-frame image estimation reconstruction object. The semi-supervised training is to use multi-frame pictures, calibrated camera parameters and a small amount of depth information as the input of the whole network, and use surface information matched with loss calculation as a label, so that the problem that most of the fields of three-dimensional reconstruction can only be subjected to supervised training is solved, and the system occupies a small memory and has high integrity.

And 4, designing a loss function and regressing. As the loss designed in fig. 2, by performing parallax comparison on depth information obtained by data enhancement consistent loss and depth information obtained by depth perception loss, an effective earth surface depth value is obtained and is used as supervision to be transmitted to the depth regression loss to optimize a depth map, and thus a better matching reconstruction effect is achieved. The data enhancement coherency loss is defined as follows:

wherein, M _τθ Representing the mask unobstructed at τ θ, D represents the predicted value of the regular forward channel of the corrected image,

extended picture prediction value expressed as a corrected picture by minimizing D and

betweenTo ensure data enhanced consistency.

The depth perception loss is defined as follows:

p′＝K′(R′R ^-1 (D(p)K ^-1 p-t)+t′)

P＝D(p)K ^-1 p-t

P′＝D(p′)K′ ^-1 p′-t′

p 'is expressed as the estimated corresponding pixel of P, the back projected three-dimensional depth information perception points D (P) and C (q) of P and P' are expressed as the predicted depth and probability value of P pixel, and epsilon _h Threshold representing highly reliable depth prediction, e _ω Representing a threshold for filtering out mismatching pixels. Wherein the basis is ∈ _h And e _ω The depth perception prediction values obtained by the two thresholds can approximately detect edges, occlusion and wrong non-occlusion areas, and the areas can form wrong corresponding relations.

The depth regression loss is defined as follows:

L _REG ＝λ ₁ L _SSIM +λ ₂ L _Smooth

wherein L is _SSIM And L _Smooth For structural similarity loss and depth smoothing loss, both are conditional losses for common depth estimation, so L _SSIM Can be expressed as:

x and y respectively represent pixel points of a window with the size of NxN in the two images, and mu _x And mu _y Respectively representing the mean values of x and y, which can be used as brightness estimation; sigma _x And σ _y The variances of x and y are respectively expressed and can be used as contrast estimation; sigma _xy Representing the covariance of x and y, which can be used as a structural similarity measure. c. C ₁ And c ₂ For a minimum parameter, a denominator of 0 can be avoided, typically 0.01 and 0.03 respectively. L is _Smooth Can be expressed as:

wherein

And with

Are continuous, x and y are pixels of the image.

From the above four loss functions, a loss function with a weight as a whole can be constructed as follows:

L＝L _R +λ ₃ L _DA +λ ₄ L _DPR ＝λ ₁ L _SSIM +λ ₂ L _Smooth +λ ₃ L _DA +λ ₄ L _DPR

the weights can be assigned as follows through a large number of comparisons and a plurality of tests: lambda [ alpha ] ₁ ＝0.2，λ ₂ ＝0.0067，λ ₃ ＝0.1，λ ₄ ＝0.8。

The training times are set to be 16, the upper limit of the number of the pictures input to the network each time is mainly determined according to the performance of a computer graphic processor, and generally, the number of the pictures input to the network each time is within a range of 1-4, so that the network training can be more stable, the training result is better, and the rapid fitting of the network is ensured. In the training process, the learning rate of the parameters is 0.001, so that the fast fitting of the network can be ensured, and the overfitting of the network cannot be caused. The algorithm of the parameter optimizer selects an adaptive matrix estimation algorithm, and the method has the advantages that after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable. The threshold of the penalty function is set to about 0.0003, and less than 0.0003 can be considered that the training of the entire network is substantially complete.

And 5, saving the model and testing. Training is performed using the DTU data set and the training model is saved. The model is used for testing an ETH3D data set and a Tank & Temples data set, and the generalization capability of the ETH3D data set and the Tank & Temples data set is evaluated.

And 6, point cloud fusion and reconstruction. Subsequent texture enhancement and grid mapping belong to the traditional algorithm, the point cloud after depth map fusion can be directly used for matching, and the reconstruction effect can be directly checked by using MeshLab software to verify the quality of a matching network; meanwhile, as shown in table 1, the accuracy and integrity of the optimal evaluation index selection can effectively evaluate the efficiency of the algorithm and measure the effect of the matching network. The formula for accuracy is as follows:

wherein G is the earth surface frame value model, G is a point in the earth surface frame value model, R is the reconstruction model, and R is a point in the earth reconstruction model. The term "is an Affyson bracket, and if the condition in the Affyson bracket is satisfied, it is 1, and if not, it is zero. d is a distance coefficient.

The formula for the integrity is as follows:

wherein G is the earth surface frame value model, G is a point in the earth surface frame value model, R is the reconstruction model, and R is a point in the earth reconstruction model. The term "1" means an Aftersen bracket, and when the condition in the Aftersen bracket is satisfied, it is zero. d is a distance coefficient.

As shown in fig. 6, the multi-view stereo matching reconstruction system based on perceptual uniform loss provided by the present invention includes:

and the point cloud fusion and reconstruction unit is used for fusing the dense depth map into three-dimensional point cloud and carrying out three-dimensional modeling according to the fused point cloud so as to obtain a stereoscopic vision map.

The implementation of convolution, activation function, concatenation operation, regularization operation, gaussian filtering is an algorithm well known to those skilled in the art, and the specific procedures and methods can be referred to in the corresponding textbooks or technical literature.

According to the invention, by constructing the multi-view stereo matching network based on the perception consistency loss, the depth map can be directly obtained by using the source image and the reference image and fused into the point cloud without other intermediate steps, and the algorithm of manually designing stereo matching in the traditional method is avoided. The feasibility and the superiority of the method are further verified by calculating the relevant indexes of the depth map obtained by the prior art. The correlation indexes of the methods proposed by the prior art and the present invention are shown in table 1:

TABLE 1 comparison of relevant indexes of the prior art and the method proposed by the present invention

It is worth noting that the learning method of the method and the system is an end-to-end learning method in the semi-supervised field, and the designed loss function belongs to an important realization function of the invention.

Claims

1. A multi-view stereo matching reconstruction method based on perception consistency loss is characterized by comprising the following steps:

step 1, constructing a network model: the whole network comprises a feature extraction module, a pyramid volume regularization module and a double correction layer module, wherein each module consists of a convolution layer, a regularization function and an activation function; the pyramid volume regularization module is used for performing down-sampling extraction and up-sampling integration regularization on the extracted features to construct a 3D cost volume so as to obtain a dense depth map; finally, the double-correction-layer module simply filters the obtained depth map to remove redundant information, optimizes depth map combination and retains useful information;

and 5, saving the model and testing: solidifying the finally determined model parameters, and directly inputting the images into a network to obtain the final three-dimensional point cloud when point cloud fusion and reconstruction operation are required;

step 6, point cloud fusion and reconstruction: fusing the three-dimensional point cloud obtained in the step 5, and checking the fused matching reconstruction effect by using MeshLab software; meanwhile, in order to further verify the quality of the model, the optimal evaluation index is selected to measure the accuracy of the algorithm and the performance of the system.

2. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the feature extraction module in the step 1 includes eight convolutional layers, wherein a first convolutional layer, a second convolutional layer, and a third convolutional layer perform downsampling, the feature map size is reduced by half, a fourth convolutional layer, a fifth convolutional layer, and a sixth convolutional layer perform downsampling on the feature map, the feature map size is reduced by half again, and finally the feature map size is output by a seventh convolutional layer and an eighth convolutional layer; the output of each convolution layer of the characteristic extraction module is subjected to regularization operation; the pyramid volume regularization module comprises ten convolutional layers, wherein the first downsampling is carried out on the extracted features by a three-dimensional convolutional layer I and a three-dimensional convolutional layer II, the second downsampling is carried out on the extracted features by a three-dimensional convolutional layer III and a three-dimensional convolutional layer IV, and feature information after convolution is output by a three-dimensional convolutional layer V, a three-dimensional convolutional layer six and a three-dimensional convolutional layer seven; the method comprises the following steps that a pair of inputs of a three-dimensional deconvolution layer are subjected to up-sampling, feature information after deconvolution is output through an activation function, outputs of two pairs of three-dimensional deconvolution layers of the three-dimensional deconvolution layer are subjected to up-sampling, feature information after deconvolution is output through the activation function, outputs of three pairs of three-dimensional deconvolution layers of four three-dimensional convolution layers are subjected to up-sampling, feature information after deconvolution is output through the activation function, and finally the feature information is integrated to construct a 3D cost volume to obtain a dense depth map; the output of each convolution layer of the pyramid volume regularization module is regularized; the double-correction-layer module comprises four convolution layers, wherein the convolution layer I, the convolution layer II and the convolution layer III carry out simple filtering on the obtained depth map to remove redundant information, and the convolution layer IV carries out up-sampling and channel conversion on input to optimize depth map combination and reserve useful information; the output of each convolution layer in the double-correction-layer module is subjected to regularization operation; the sizes of the two-dimensional convolution kernels in all the convolution layers are unified to be nxn; the sizes of the three-dimensional convolution kernels are uniformly n × n × n.

3. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the DTU is used for the data set in the training process in the step 3; estimating the pose of a reconstructed object from a plurality of frames of images to carry out semi-supervised training by calculating internal and external parameters of a camera in a data set; the method comprises the steps of taking multi-frame pictures, calibrated camera parameters and depth information as input of the whole network, taking earth surface information matched with loss calculation as a label, and solving the problem that most of the fields of three-dimensional reconstruction can only be supervised training.

4. The perceptual uniform loss-based multi-view stereo matching reconstruction method according to claim 1, wherein the loss function in the step 4 selects a combination of the data enhancement uniform loss and the depth perception loss in a training process; the obtained depth image can ensure the smoothness and the consistency of the depth image to be equivalent, and needs a detailed part of the authenticity of the surface of the protrusion, so that the matching reconstruction effect of the fused point cloud is improved.

5. The method for reconstructing multi-view stereo matching based on perceptual uniform loss as claimed in claim 1, wherein in the step 5, after the model is saved, the ETH3D dataset and the Tank & Temples dataset are used for testing to evaluate generalization ability thereof.

6. The multi-view stereo matching reconstruction method based on perceptual uniform loss as claimed in claim 1, wherein the evaluation index selection accuracy and integrity in the step 6 can effectively evaluate the efficiency of the algorithm and measure the effect of the matching network.

7. A multi-view stereo matching reconstruction system based on perceptual coherence loss is characterized by comprising:

the image acquisition unit is used for acquiring a source image and reference images of a plurality of visual angles, wherein the reference images can be processed to remove part of surface information so as to carry out semi-supervised training;

the image training unit is used for inputting a source image and a reference image into a network for training, estimating the pose of a reconstructed object from a plurality of frames of images through calculation of internal and external parameters of a camera in a data set, and performing semi-supervised training; inputting the initially calculated depth map into a correction layer designed in the method to filter redundant depth information to obtain a smoother depth map;