CN115170921A - Binocular stereo matching method based on bilateral grid learning and edge loss - Google Patents
Binocular stereo matching method based on bilateral grid learning and edge loss Download PDFInfo
- Publication number
- CN115170921A CN115170921A CN202210794705.XA CN202210794705A CN115170921A CN 115170921 A CN115170921 A CN 115170921A CN 202210794705 A CN202210794705 A CN 202210794705A CN 115170921 A CN115170921 A CN 115170921A
- Authority
- CN
- China
- Prior art keywords
- loss
- parallax
- edge
- cost
- pseudo
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000002146 bilateral effect Effects 0.000 title claims abstract description 29
- 230000002776 aggregation Effects 0.000 claims abstract description 14
- 238000004220 aggregation Methods 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 14
- 238000007670 refining Methods 0.000 claims abstract description 4
- 238000005070 sampling Methods 0.000 claims abstract 2
- 238000012549 training Methods 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 9
- 238000013135 deep learning Methods 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000003708 edge detection Methods 0.000 claims description 3
- 230000000644 propagated effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000011800 void material Substances 0.000 description 2
- 239000012141 concentrate Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4007—Interpolation-based scaling, e.g. bilinear interpolation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4038—Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/32—Indexing scheme for image data processing or generation, in general involving image mosaicing
Abstract
The invention discloses a binocular stereo matching method based on bilateral grid learning and edge loss, which comprises the following steps of: step 1, feature extraction; step 2, constructing a guide map; step 3, constructing a cost space; step 4, cost aggregation; step 5, cost up-sampling; step 6, parallax regression; step 7, refining the full-resolution parallax; step 8, calculating loss; and 9, updating the parameters. The method has high accuracy of parallax estimation of the edge area, reduces errors and can carry out real-time reasoning.
Description
Technical Field
The invention relates to the field of three-dimensional reconstruction and stereoscopic vision, in particular to a binocular stereo matching method based on bilateral grid learning and edge loss.
Background
With the development of science and technology, two-dimensional plane information gradually cannot meet the application requirements of people in some aspects, such as automatic driving, robot autonomous navigation, face recognition, reverse engineering and the like. In the two-dimensional image acquisition process, an important scene clue of depth is lost, so that a machine cannot realize full understanding of a real scene. Technologies such as stereoscopic vision, time Of flight (Time Of flight), and structured light technology have become hot Of research in order to acquire depth information.
Compared with other technologies, the stereoscopic vision technology has the advantages of low cost, high efficiency, strong adaptability and the like, and is still one of key technologies in the field of three-dimensional reconstruction at present. In stereo vision, stereo vision that uses only two cameras (or two images) for reconstruction is often referred to as binocular stereo vision.
The binocular stereo matching technology is a key step in the binocular stereo vision technology, and the stereo matching result determines the quality of three-dimensional reconstruction to a great extent. Compared with the traditional method mainly adopting manual design characteristics, the appearance of deep learning provides a new solution for researchers: and stereo matching is performed by utilizing strong learning ability of deep learning. In recent years, more efficient and robust features can be obtained by deep learning based algorithms, which have gradually gained more accuracy than traditional methods.
Among the disclosed technologies, publication No. CN 112150521A, entitled binocular vision stereo matching method based on void convolution and cascade cost convolution, publication No. CN 111833386A, entitled pyramid binocular stereo matching method based on multi-scale information and attention mechanism, and other technologies, often adopt the structure of void convolution and pyramid-like when constructing a neural network. Both techniques tend to cause the neural network to produce a grid effect and blurred edges. Due to such defects, these networks tend to have a flat area with a relatively low Endpoint Error (Endpoint Error), but edge areas with a high Endpoint Error. BGNet (Xu B, xu Y, yang X, et al. Bilateral grid learning for stereo matching networks [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2021: 12497-12506.) based on bilateral grid upsampling is the first work of introducing bilateral grid learning to perform stereo matching, a model can perform cost aggregation under extremely low resolution through the designed bilateral grid cost upsampling module, and finally the model is upsampled to have high resolution, so that the model reasoning speed is greatly accelerated and has certain edge protection capability. However, the single-channel picture input removes excessive information, so that the quality of the extracted features is reduced; the final predicted disparity of the model is derived from bilinear upsampling, introducing a checkerboard artifact. Therefore, the final endpoint error of the network on the Scene Flow data set test set is 1.17, and the accuracy is low.
Disclosure of Invention
The invention aims to provide a binocular stereo matching method based on bilateral mesh learning and edge loss aiming at the defects in the prior art. The method has high accuracy of parallax estimation of the edge area, can reduce errors and can carry out real-time reasoning.
The technical scheme for realizing the purpose of the invention is as follows:
a binocular stereo matching method based on bilateral mesh learning and edge loss comprises the following steps:
step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network l And I r In which I l And I r Respectively a left view and a right view, using the structure of residual error as a feature extraction module to extract the features of the left view and the right view, wherein the weight of the feature extraction module is shared, and in the shallow part of the feature extraction module, four convolution layers constructed by 3 multiplied by 3 convolution are used for extracting initial shallow features f with the shape of 1/2H multiplied by 1/2W multiplied by 12 g Then, after four residual layers and four convolution layers with alternate ascending and descending dimensions, the intermediate features are spliced to obtain the features with the shape of H/8 xW/8 x 352,the extracted features of the left and right views are respectively marked as f l 、f r ;
Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, an image I down-sampled to the original left view 1/2 resolution is used 1/2 And f g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 g 'reducing the number of channels to 1 by carrying out 3 × 3 convolution on fg' with a normalization layer to obtain a final guide map G;
step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, the cost space is constructed by a group-by-group correlation method instead of direct splicing or mixed use of the two, and a group-by-group correlation calculation formula is as follows:
wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channel, (x, y) are element coordinates, the number of groups is set to be 44, and the parallax D when the group-by-group correlation is constructed is 1/8D max ,D max Is a preset maximum parallax;
step 4, cost aggregation: in the stereo matching network, a plurality of stacked hourglass structures are used for cost aggregation, and in order to improve efficiency, one hourglass structure is used for cost aggregation, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;
step 5, cost upsampling: the bilateral grid learning is a learnable bilateral grid, which has the advantages of high bilateral grid speed and strong edge protection capability, and the guide graph and the cost space of the bilateral grid are learnable modules, so that the generalization capability is strong, and specifically, the guide graph G obtained in the step 2 and the C obtained in the step 4 are combined l As an input of the bilateral grid learning, a high-resolution generation is finally obtained through bilateral grid upsampling (realized through grid _ sample function of pytorech)Valence space C H The shape is 16 multiplied by 25 multiplied by 1/2H multiplied by 1/2W;
step 6, parallax regression: a high resolution cost space C is obtained H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax l The shape is 1/2 Hx 1/2W:
where k is the disparity level, set to 1/2D max ;
Step 7, refining the full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement l Carrying out bilinear interpolation to obtain parallax d of full resolution h Then to d h Performing full resolution parallax refinement, specifically on the input d h 、I l And I r Combining the wrap operation and the splicing operation together, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d final After full resolution disparity prediction in the training phase, d is calculated h As intermediate supervision and with final predicted disparity d final Collectively as the predicted value of the model;
step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, and the Loss is constructed by using Smooth L1 Loss which is relatively robust to outliers, abnormal values and relatively small in gradient change, and the formula is as follows:
the overall loss function consists of three parts:
loss function Loss gt The Smooth L1 Loss is calculated for the predicted value and the real value, and the specific formula is as follows:
wherein, d gt Is the true disparity at the point p,
providing real parallax by adopting a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used by a deep learning stereo matching method, wherein the Scene Flow data set is a large-scale synthetic data set and provides dense real parallax; the KITTI 2012 data set and the KITTI 2015 data set are data sets of real street view, providing sparse real parallax;
when the true disparity provided by the dataset is sparse, the total penalty plus Loss function pseudo ,Loss pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:
wherein, P pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d pseudo Representing a pseudo-disparity at a p-point;
finally, loss of edge Loss edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E 1 (ii) a Then use the 5 x 5 rectangular region pair E 1 Expansion is carried out to obtain a final edge map E 2 ,
Loss of edge Loss edges The calculation formula of (a) is as follows:
wherein, P E Shows an edge map E 2 Set of points contained in the middle edge region, d E Representing the true disparity at the p point of the edge region;
total Loss total Is Loss gt 、Loss pseudo And Loss edges The sum is expressed by the following formula:
Loss total =w 1 ·Loss gt +w 2 ·Loss pseudo +w 3 ·Loss edges ,
wherein w 1 、w 2 And w 3 Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;
during training, d is calculated separately h And d final Total loss of (d) is obtained as loss 1 、loss 2 Finally, 0.8 × loss is used 1 +1.0×loss 2 Counter propagating the value of (a);
step 9, updating parameters: the Adam optimizer is used for updating parameters, and the multi-step attenuation is used for the attenuation of the learning rate;
dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in the training process, and the test set is used for evaluating the quality of the final model prediction parallax;
and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.
Different from other methods, the technical scheme fully considers inference speed and parallax prediction of edge regions, and makes the following innovation: the convolution operation in the convolution neural network can blur edges, rich detail information is reserved for reducing the influence of blurring, and a guide graph of bilateral grid learning is constructed in a mode of extracting shallow features and fusing original images in a crossed mode in a feature extraction stage; in order to ensure the real-time performance of the network while improving the precision, only a group-by-group related construction mode is used when a cost space is constructed, bilateral grid learning is adopted to realize 3d cost aggregation under low resolution, and a loss function without extra inference time consumption is used; the parallax of the edge region belongs to a region difficult to predict in the whole visual field, so that the prediction error of the parallax is mostly concentrated in the edge region based on the stereo matching network of deep learning, and edge loss lossieges is introduced in order to concentrate the emphasis of a loss function in the edge region difficult to predict.
The technical scheme has the following specific effects: by optimizing the parallax prediction of the edge region by using bilateral learning with strong edge protection capability and edge loss, the end point error of the edge region can be obviously reduced while the end point error of the flat region is reduced, and real-time reasoning can be performed.
The method has high accuracy of parallax estimation of the edge area, reduces errors and can carry out real-time reasoning.
Drawings
Fig. 1 is a schematic diagram of a stereo matching network architecture according to an embodiment;
FIG. 2 is a flowchart of a stereo matching method according to an embodiment;
fig. 3 is a schematic diagram of a model training process according to an embodiment.
Detailed Description
The invention will be described in further detail with reference to the following drawings and specific examples, but the invention is not limited thereto.
Example (b):
referring to fig. 2, a binocular stereo matching method based on bilateral mesh learning and edge loss includes the following steps:
step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network l And I r In which I l And I r The left view and the right view are respectively, and before data is input into the model, the following processing needs to be carried out on the left view and the right view: stereo correction, random clipping, random color adjustment (including contrast, gamma, brightness, hue, saturation), random vertical flipping, creation of edge maps, normalization to [0,1.0 ]]And regularizing by using the mean value and variance of ImageNet, wherein the random clipping, the random color adjustment and the random vertical inversion all have 1/2 probability to be used, preprocessing the obtained left and right views, inputting the left and right views into a stereo matching network, as shown in figure 1, extracting the features of the left and right views by using a residual structure as a feature extraction module, wherein the weight of the feature extraction module is shared, and in the shallow part of the feature extraction module, extracting the obtained initial shallow feature f with the shape of 1/2H multiplied by 1/2W multiplied by 12 by using four convolution layers constructed by 3 multiplied by 3 convolution g Then, through four residual layers and four convolution layers with alternate ascending and descending dimensions, the middle features are spliced to obtain features with the shape of H/8 xW/8 x352, and the features extracted from the left view and the right view are respectively marked as f l 、f r ;
Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, an image I down-sampled to the original left view 1/2 resolution is used 1/2 And f g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 g 'reducing the number of channels to 1 by carrying out 3 × 3 convolution on fg' with a normalization layer to obtain a final guide map G;
step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, a cost space is constructed by a group-by-group correlation method, and a group-by-group correlation calculation formula is as follows:
wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channels, (x, y) are element coordinates, the number of groups is set to 44, and the parallax D when group-by-group correlation is constructed is 1/8D max ,D max Is a preset maximum parallax;
step 4, cost polymerization: in the stereo matching network, a plurality of stacked hourglass structures are used for cost aggregation, and in order to improve efficiency, one hourglass structure is used for cost aggregation, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;
step 5, cost upsampling: the bilateral grid learning is a learnable bilateral grid, which has the advantages of high bilateral grid speed and strong edge protection capability, and the guide graph and the cost space of the bilateral grid are learnable modules, so that the generalization capability is strong, and specifically, the guide graph G obtained in the step 2 and the C obtained in the step 4 are combined l As input for bilateral grid learningFinally, the high-resolution cost space C is obtained through bilateral grid upsampling (realized through grid _ sample function of Pythrch) H The shape is 16 × 25 × 1/2H × 1/2W;
step 6, parallax regression: a high resolution cost space C is obtained H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax l Shape 1/2H × 1/2W:
where k is the parallax level and is set to 1/2D max ;
And 7, refining full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement l Carrying out bilinear interpolation to obtain parallax d of full resolution h Then to d h Performing full resolution parallax refinement, specifically on the input d h 、I l And I r Combining the wrap operation and the splicing operation together, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d final After full resolution disparity prediction in the training phase, d is calculated h As intermediate supervision and with the final predicted disparity d final Collectively as the predicted value of the model;
step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, and the Loss is constructed by using Smooth L1 Loss which is relatively robust to outliers, abnormal values and relatively small in gradient change, and the formula is as follows:
the overall loss function consists of three parts:
loss function Loss gt The Smooth L1 Loss is calculated for the predicted value and the true value, and the specific formula is as follows:
wherein d is gt Being the true disparity at the p-point,
providing real parallax by adopting a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used by a deep learning stereo matching method, wherein the Scene Flow data set is a large-scale synthetic data set and provides dense real parallax; the KITTI 2012 and KITTI 2015 data sets are data sets of real street views, providing sparse real parallax;
when the true disparity provided by the dataset is sparse, the total penalty function plus Loss pseudo ,Loss pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:
wherein, P pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d pseudo Representing a pseudo-parallax at a p point;
finally, loss of edge Loss edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E 1 (ii) a Then using a 5 x 5 rectangular region pair E 1 Expansion is carried out to obtain a final edge map E 2 ,
Loss of edge Loss edges The calculation formula of (a) is as follows:
wherein, P E Shows an edge map E 2 Set of points contained in the middle edge region, d E Representing the true disparity at the p point of the edge region;
total Loss total Is Loss gt 、Loss pseudo And Loss edges The sum is expressed by the following formula:
Loss total =w 1 ·Loss gt +w 2 ·Loss pseudo +w 3 ·Loss edges ,
wherein w 1 、w 2 And w 3 Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;
during training, d is calculated separately h And d final Total loss of (c) is obtained as loss 1 、loss 2 Finally, 0.8 × loss is used 1 +1.0×loss 2 The value of (a) is propagated backwards;
step 9, updating parameters: the model training process is shown in FIG. 3, using the Adam optimizer (β) in this example 1 =0.9,β 2 = 0.999), the update of learning rate is a multi-step update (MultiStepLR), the step size is set to (20, 30,40,50,60, 70), and each time the attenuation coefficient is 0.1;
dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in a training process, and the test set is used for evaluating the quality of the final model prediction parallax;
and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.
Claims (1)
1. A binocular stereo matching method based on bilateral mesh learning and edge loss is characterized by comprising the following steps:
step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network l And I r In which I l And I r Respectively a left view and a right view, extracting the features of the left view and the right view by using a residual structure as a feature extraction module, wherein the weight of the feature extraction module is shared, and four convolution layers constructed by 3 multiplied by 3 convolution are extracted at the shallow part of the feature extraction moduleInitial shallow feature f in the shape of 1/2 Hx 1/2 Wx 12 g Then, through four residual layers and four convolution layers with alternate ascending and descending dimensions, the middle features are spliced to obtain features with the shape of H/8 xW/8 x352, and the features extracted from the left view and the right view are respectively marked as f l 、f r ;
Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, use is made of an image I down-sampled to the resolution of 1/2 of the original left view 1/2 And f g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 g 'reducing the number of channels to 1 by carrying out 3 multiplied by 3 convolution on fg' with a normalization layer to obtain a final guide graph G;
step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, a cost space is constructed by a group-by-group correlation method, and a group-by-group correlation calculation formula is as follows:
wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channel, (x, y) are element coordinates, the number of groups is set to be 44, and the parallax D when the group-by-group correlation is constructed is 1/8D max ,D max Is a preset maximum parallax;
step 4, cost aggregation: in the stereo matching network, cost aggregation is performed by using a plurality of stacked hourglass structures, and in order to improve efficiency, cost aggregation is performed by using one hourglass structure, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;
step 5, cost upsampling: the guide map G obtained in the step 2 and C obtained in the step 4 are compared l As the input of bilateral grid learning, the cost space C with high resolution is finally obtained by bilateral grid up-sampling H The shape is 16 multiplied by 25 multiplied by 1/2H multiplied by 1/2W;
step 6And parallax regression: a high resolution cost space C is obtained H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax l The shape is 1/2 Hx 1/2W:
where k is the disparity level, set to 1/2D max ;
Step 7, refining the full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement l Carrying out bilinear interpolation to obtain parallax d of full resolution h Then to d h Performing full resolution parallax refinement, specifically on the input d h 、I l And I r Combining the wrap operation and the splicing operation, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d final After full resolution disparity prediction in the training phase, d is calculated h As intermediate supervision and with the final predicted disparity d final Collectively as predicted values for the model;
step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, smooth L1 Loss which is relatively robust to outliers and abnormal values and small in gradient change is used for constructing Loss, and the formula is as follows:
the overall loss function consists of three parts:
loss function Loss gt The Smooth L1 Loss is calculated for the predicted value and the real value, and the specific formula is as follows:
wherein, d gt Being the true disparity at the p-point,
a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used in a deep learning stereo matching method provide real parallax, and when the real parallax provided by the data set is sparse, the total Loss function plus Loss function is added pseudo ,Loss pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:
wherein, P pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d pseudo Representing a pseudo-disparity at a p-point;
last is edge Loss edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E 1 (ii) a Then using a 5 x 5 rectangular region pair E 1 Expansion is carried out to obtain a final edge map E 2 ,
Loss of edge Loss edges The calculation formula of (a) is as follows:
wherein, P E Shows an edge map E 2 Set of points contained in the middle edge region, d E Representing the true disparity at the p point of the edge region;
total Loss total Is Loss gt 、Loss pseudo And Loss edges The sum is expressed by the following formula:
Loss total =w 1 ·Loss gt +w 2 ·Loss pseudo +w 3 ·Loss edges ,
wherein w 1 、w 2 And w 3 Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;
during training, d is calculated separately h And d final Total loss of (c) is obtained as loss 1 、loss 2 Finally, 0.8 × loss is used 1 +1.0×loss 2 The value of (a) is propagated backwards;
and 9, updating parameters: updating parameters by using an Adam optimizer, and using multi-step attenuation for learning rate attenuation;
dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in the training process, and the test set is used for evaluating the quality of the final model prediction parallax;
and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210794705.XA CN115170921A (en) | 2022-07-07 | 2022-07-07 | Binocular stereo matching method based on bilateral grid learning and edge loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210794705.XA CN115170921A (en) | 2022-07-07 | 2022-07-07 | Binocular stereo matching method based on bilateral grid learning and edge loss |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115170921A true CN115170921A (en) | 2022-10-11 |
Family
ID=83490332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210794705.XA Withdrawn CN115170921A (en) | 2022-07-07 | 2022-07-07 | Binocular stereo matching method based on bilateral grid learning and edge loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115170921A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116128946A (en) * | 2022-12-09 | 2023-05-16 | 东南大学 | Binocular infrared depth estimation method based on edge guiding and attention mechanism |
-
2022
- 2022-07-07 CN CN202210794705.XA patent/CN115170921A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116128946A (en) * | 2022-12-09 | 2023-05-16 | 东南大学 | Binocular infrared depth estimation method based on edge guiding and attention mechanism |
CN116128946B (en) * | 2022-12-09 | 2024-02-09 | 东南大学 | Binocular infrared depth estimation method based on edge guiding and attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110490919B (en) | Monocular vision depth estimation method based on deep neural network | |
CN110443842B (en) | Depth map prediction method based on visual angle fusion | |
CN108376392B (en) | Image motion blur removing method based on convolutional neural network | |
CN110782490A (en) | Video depth map estimation method and device with space-time consistency | |
CN111739077A (en) | Monocular underwater image depth estimation and color correction method based on depth neural network | |
CN108765479A (en) | Using deep learning to monocular view estimation of Depth optimization method in video sequence | |
CN110070489A (en) | Binocular image super-resolution method based on parallax attention mechanism | |
CN113592026A (en) | Binocular vision stereo matching method based on void volume and cascade cost volume | |
CN113837946B (en) | Lightweight image super-resolution reconstruction method based on progressive distillation network | |
CN109949354B (en) | Light field depth information estimation method based on full convolution neural network | |
CN111508013A (en) | Stereo matching method | |
CN112422870B (en) | Deep learning video frame insertion method based on knowledge distillation | |
CN115393410A (en) | Monocular view depth estimation method based on nerve radiation field and semantic segmentation | |
CN112580473A (en) | Motion feature fused video super-resolution reconstruction method | |
CN116958534A (en) | Image processing method, training method of image processing model and related device | |
CN114638842B (en) | Medical image segmentation method based on MLP | |
Xu et al. | AutoSegNet: An automated neural network for image segmentation | |
CN115239564A (en) | Mine image super-resolution reconstruction method combining semantic information | |
CN115511708A (en) | Depth map super-resolution method and system based on uncertainty perception feature transmission | |
CN116563459A (en) | Text-driven immersive open scene neural rendering and mixing enhancement method | |
CN115170921A (en) | Binocular stereo matching method based on bilateral grid learning and edge loss | |
CN112116646B (en) | Depth estimation method for light field image based on depth convolution neural network | |
CN112785502B (en) | Light field image super-resolution method of hybrid camera based on texture migration | |
CN104143203A (en) | Image editing and communication method | |
CN115578260A (en) | Attention method and system for direction decoupling for image super-resolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20221011 |
|
WW01 | Invention patent application withdrawn after publication |