CN115170921A

CN115170921A - Binocular stereo matching method based on bilateral grid learning and edge loss

Info

Publication number: CN115170921A
Application number: CN202210794705.XA
Authority: CN
Inventors: 陈明; 闭韦杰; 张正钦; 容仕军
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-10-11

Abstract

The invention discloses a binocular stereo matching method based on bilateral grid learning and edge loss, which comprises the following steps of: step 1, feature extraction; step 2, constructing a guide map; step 3, constructing a cost space; step 4, cost aggregation; step 5, cost up-sampling; step 6, parallax regression; step 7, refining the full-resolution parallax; step 8, calculating loss; and 9, updating the parameters. The method has high accuracy of parallax estimation of the edge area, reduces errors and can carry out real-time reasoning.

Description

Binocular stereo matching method based on bilateral grid learning and edge loss

Technical Field

The invention relates to the field of three-dimensional reconstruction and stereoscopic vision, in particular to a binocular stereo matching method based on bilateral grid learning and edge loss.

Background

With the development of science and technology, two-dimensional plane information gradually cannot meet the application requirements of people in some aspects, such as automatic driving, robot autonomous navigation, face recognition, reverse engineering and the like. In the two-dimensional image acquisition process, an important scene clue of depth is lost, so that a machine cannot realize full understanding of a real scene. Technologies such as stereoscopic vision, time Of flight (Time Of flight), and structured light technology have become hot Of research in order to acquire depth information.

Compared with other technologies, the stereoscopic vision technology has the advantages of low cost, high efficiency, strong adaptability and the like, and is still one of key technologies in the field of three-dimensional reconstruction at present. In stereo vision, stereo vision that uses only two cameras (or two images) for reconstruction is often referred to as binocular stereo vision.

The binocular stereo matching technology is a key step in the binocular stereo vision technology, and the stereo matching result determines the quality of three-dimensional reconstruction to a great extent. Compared with the traditional method mainly adopting manual design characteristics, the appearance of deep learning provides a new solution for researchers: and stereo matching is performed by utilizing strong learning ability of deep learning. In recent years, more efficient and robust features can be obtained by deep learning based algorithms, which have gradually gained more accuracy than traditional methods.

Among the disclosed technologies, publication No. CN 112150521A, entitled binocular vision stereo matching method based on void convolution and cascade cost convolution, publication No. CN 111833386A, entitled pyramid binocular stereo matching method based on multi-scale information and attention mechanism, and other technologies, often adopt the structure of void convolution and pyramid-like when constructing a neural network. Both techniques tend to cause the neural network to produce a grid effect and blurred edges. Due to such defects, these networks tend to have a flat area with a relatively low Endpoint Error (Endpoint Error), but edge areas with a high Endpoint Error. BGNet (Xu B, xu Y, yang X, et al. Bilateral grid learning for stereo matching networks [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2021: 12497-12506.) based on bilateral grid upsampling is the first work of introducing bilateral grid learning to perform stereo matching, a model can perform cost aggregation under extremely low resolution through the designed bilateral grid cost upsampling module, and finally the model is upsampled to have high resolution, so that the model reasoning speed is greatly accelerated and has certain edge protection capability. However, the single-channel picture input removes excessive information, so that the quality of the extracted features is reduced; the final predicted disparity of the model is derived from bilinear upsampling, introducing a checkerboard artifact. Therefore, the final endpoint error of the network on the Scene Flow data set test set is 1.17, and the accuracy is low.

Disclosure of Invention

The invention aims to provide a binocular stereo matching method based on bilateral mesh learning and edge loss aiming at the defects in the prior art. The method has high accuracy of parallax estimation of the edge area, can reduce errors and can carry out real-time reasoning.

The technical scheme for realizing the purpose of the invention is as follows:

a binocular stereo matching method based on bilateral mesh learning and edge loss comprises the following steps:

step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network _l And I _r In which I _l And I _r Respectively a left view and a right view, using the structure of residual error as a feature extraction module to extract the features of the left view and the right view, wherein the weight of the feature extraction module is shared, and in the shallow part of the feature extraction module, four convolution layers constructed by 3 multiplied by 3 convolution are used for extracting initial shallow features f with the shape of 1/2H multiplied by 1/2W multiplied by 12 _g Then, after four residual layers and four convolution layers with alternate ascending and descending dimensions, the intermediate features are spliced to obtain the features with the shape of H/8 xW/8 x 352,the extracted features of the left and right views are respectively marked as f _l 、f _r ；

Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, an image I down-sampled to the original left view 1/2 resolution is used _1/2 And f _g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 _g 'reducing the number of channels to 1 by carrying out 3 × 3 convolution on fg' with a normalization layer to obtain a final guide map G;

step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, the cost space is constructed by a group-by-group correlation method instead of direct splicing or mixed use of the two, and a group-by-group correlation calculation formula is as follows:

wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channel, (x, y) are element coordinates, the number of groups is set to be 44, and the parallax D when the group-by-group correlation is constructed is 1/8D _max ，D _max Is a preset maximum parallax;

step 4, cost aggregation: in the stereo matching network, a plurality of stacked hourglass structures are used for cost aggregation, and in order to improve efficiency, one hourglass structure is used for cost aggregation, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation _l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;

step 5, cost upsampling: the bilateral grid learning is a learnable bilateral grid, which has the advantages of high bilateral grid speed and strong edge protection capability, and the guide graph and the cost space of the bilateral grid are learnable modules, so that the generalization capability is strong, and specifically, the guide graph G obtained in the step 2 and the C obtained in the step 4 are combined _l As an input of the bilateral grid learning, a high-resolution generation is finally obtained through bilateral grid upsampling (realized through grid _ sample function of pytorech)Valence space C _H The shape is 16 multiplied by 25 multiplied by 1/2H multiplied by 1/2W;

step 6, parallax regression: a high resolution cost space C is obtained _H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax _l The shape is 1/2 Hx 1/2W:

where k is the disparity level, set to 1/2D _max ；

Step 7, refining the full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement _l Carrying out bilinear interpolation to obtain parallax d of full resolution _h Then to d _h Performing full resolution parallax refinement, specifically on the input d _h 、I _l And I _r Combining the wrap operation and the splicing operation together, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d _final After full resolution disparity prediction in the training phase, d is calculated _h As intermediate supervision and with final predicted disparity d _final Collectively as the predicted value of the model;

step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, and the Loss is constructed by using Smooth L1 Loss which is relatively robust to outliers, abnormal values and relatively small in gradient change, and the formula is as follows:

the overall loss function consists of three parts:

loss function Loss _gt The Smooth L1 Loss is calculated for the predicted value and the real value, and the specific formula is as follows:

wherein, d _gt Is the true disparity at the point p,

providing real parallax by adopting a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used by a deep learning stereo matching method, wherein the Scene Flow data set is a large-scale synthetic data set and provides dense real parallax; the KITTI 2012 data set and the KITTI 2015 data set are data sets of real street view, providing sparse real parallax;

when the true disparity provided by the dataset is sparse, the total penalty plus Loss function _pseudo ，Loss _pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:

wherein, P _pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d _pseudo Representing a pseudo-disparity at a p-point;

finally, loss of edge Loss _edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E ₁ (ii) a Then use the 5 x 5 rectangular region pair E ₁ Expansion is carried out to obtain a final edge map E ₂ ，

Loss of edge Loss _edges The calculation formula of (a) is as follows:

wherein, P _E Shows an edge map E ₂ Set of points contained in the middle edge region, d _E Representing the true disparity at the p point of the edge region;

total Loss _total Is Loss _gt 、Loss _pseudo And Loss _edges The sum is expressed by the following formula:

Loss _total ＝w ₁ ·Loss _gt +w ₂ ·Loss _pseudo +w ₃ ·Loss _edges ，

wherein w ₁ 、w ₂ And w ₃ Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;

during training, d is calculated separately _h And d _final Total loss of (d) is obtained as loss ₁ 、loss ₂ Finally, 0.8 × loss is used ₁ +1.0×loss ₂ Counter propagating the value of (a);

step 9, updating parameters: the Adam optimizer is used for updating parameters, and the multi-step attenuation is used for the attenuation of the learning rate;

dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in the training process, and the test set is used for evaluating the quality of the final model prediction parallax;

and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.

Different from other methods, the technical scheme fully considers inference speed and parallax prediction of edge regions, and makes the following innovation: the convolution operation in the convolution neural network can blur edges, rich detail information is reserved for reducing the influence of blurring, and a guide graph of bilateral grid learning is constructed in a mode of extracting shallow features and fusing original images in a crossed mode in a feature extraction stage; in order to ensure the real-time performance of the network while improving the precision, only a group-by-group related construction mode is used when a cost space is constructed, bilateral grid learning is adopted to realize 3d cost aggregation under low resolution, and a loss function without extra inference time consumption is used; the parallax of the edge region belongs to a region difficult to predict in the whole visual field, so that the prediction error of the parallax is mostly concentrated in the edge region based on the stereo matching network of deep learning, and edge loss lossieges is introduced in order to concentrate the emphasis of a loss function in the edge region difficult to predict.

The technical scheme has the following specific effects: by optimizing the parallax prediction of the edge region by using bilateral learning with strong edge protection capability and edge loss, the end point error of the edge region can be obviously reduced while the end point error of the flat region is reduced, and real-time reasoning can be performed.

The method has high accuracy of parallax estimation of the edge area, reduces errors and can carry out real-time reasoning.

Drawings

Fig. 1 is a schematic diagram of a stereo matching network architecture according to an embodiment;

FIG. 2 is a flowchart of a stereo matching method according to an embodiment;

fig. 3 is a schematic diagram of a model training process according to an embodiment.

Detailed Description

The invention will be described in further detail with reference to the following drawings and specific examples, but the invention is not limited thereto.

Example (b):

referring to fig. 2, a binocular stereo matching method based on bilateral mesh learning and edge loss includes the following steps:

step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network _l And I _r In which I _l And I _r The left view and the right view are respectively, and before data is input into the model, the following processing needs to be carried out on the left view and the right view: stereo correction, random clipping, random color adjustment (including contrast, gamma, brightness, hue, saturation), random vertical flipping, creation of edge maps, normalization to [0,1.0 ]]And regularizing by using the mean value and variance of ImageNet, wherein the random clipping, the random color adjustment and the random vertical inversion all have 1/2 probability to be used, preprocessing the obtained left and right views, inputting the left and right views into a stereo matching network, as shown in figure 1, extracting the features of the left and right views by using a residual structure as a feature extraction module, wherein the weight of the feature extraction module is shared, and in the shallow part of the feature extraction module, extracting the obtained initial shallow feature f with the shape of 1/2H multiplied by 1/2W multiplied by 12 by using four convolution layers constructed by 3 multiplied by 3 convolution _g Then, through four residual layers and four convolution layers with alternate ascending and descending dimensions, the middle features are spliced to obtain features with the shape of H/8 xW/8 x352, and the features extracted from the left view and the right view are respectively marked as f _l 、f _r ；

step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, a cost space is constructed by a group-by-group correlation method, and a group-by-group correlation calculation formula is as follows:

wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channels, (x, y) are element coordinates, the number of groups is set to 44, and the parallax D when group-by-group correlation is constructed is 1/8D _max ，D _max Is a preset maximum parallax;

step 4, cost polymerization: in the stereo matching network, a plurality of stacked hourglass structures are used for cost aggregation, and in order to improve efficiency, one hourglass structure is used for cost aggregation, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation _l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;

step 5, cost upsampling: the bilateral grid learning is a learnable bilateral grid, which has the advantages of high bilateral grid speed and strong edge protection capability, and the guide graph and the cost space of the bilateral grid are learnable modules, so that the generalization capability is strong, and specifically, the guide graph G obtained in the step 2 and the C obtained in the step 4 are combined _l As input for bilateral grid learningFinally, the high-resolution cost space C is obtained through bilateral grid upsampling (realized through grid _ sample function of Pythrch) _H The shape is 16 × 25 × 1/2H × 1/2W;

step 6, parallax regression: a high resolution cost space C is obtained _H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax _l Shape 1/2H × 1/2W:

where k is the parallax level and is set to 1/2D _max ；

And 7, refining full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement _l Carrying out bilinear interpolation to obtain parallax d of full resolution _h Then to d _h Performing full resolution parallax refinement, specifically on the input d _h 、I _l And I _r Combining the wrap operation and the splicing operation together, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d _final After full resolution disparity prediction in the training phase, d is calculated _h As intermediate supervision and with the final predicted disparity d _final Collectively as the predicted value of the model;

the overall loss function consists of three parts:

loss function Loss _gt The Smooth L1 Loss is calculated for the predicted value and the true value, and the specific formula is as follows:

wherein d is _gt Being the true disparity at the p-point,

providing real parallax by adopting a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used by a deep learning stereo matching method, wherein the Scene Flow data set is a large-scale synthetic data set and provides dense real parallax; the KITTI 2012 and KITTI 2015 data sets are data sets of real street views, providing sparse real parallax;

when the true disparity provided by the dataset is sparse, the total penalty function plus Loss _pseudo ，Loss _pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:

wherein, P _pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d _pseudo Representing a pseudo-parallax at a p point;

finally, loss of edge Loss _edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E ₁ (ii) a Then using a 5 x 5 rectangular region pair E ₁ Expansion is carried out to obtain a final edge map E ₂ ，

Loss of edge Loss _edges The calculation formula of (a) is as follows:

Loss _total ＝w ₁ ·Loss _gt +w ₂ ·Loss _pseudo +w ₃ ·Loss _edges ，

during training, d is calculated separately _h And d _final Total loss of (c) is obtained as loss ₁ 、loss ₂ Finally, 0.8 × loss is used ₁ +1.0×loss ₂ The value of (a) is propagated backwards;

step 9, updating parameters: the model training process is shown in FIG. 3, using the Adam optimizer (β) in this example ₁ ＝0.9，β ₂ = 0.999), the update of learning rate is a multi-step update (MultiStepLR), the step size is set to (20, 30,40,50,60, 70), and each time the attenuation coefficient is 0.1;

dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in a training process, and the test set is used for evaluating the quality of the final model prediction parallax;

Claims

1. A binocular stereo matching method based on bilateral mesh learning and edge loss is characterized by comprising the following steps:

step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network _l And I _r In which I _l And I _r Respectively a left view and a right view, extracting the features of the left view and the right view by using a residual structure as a feature extraction module, wherein the weight of the feature extraction module is shared, and four convolution layers constructed by 3 multiplied by 3 convolution are extracted at the shallow part of the feature extraction moduleInitial shallow feature f in the shape of 1/2 Hx 1/2 Wx 12 _g Then, through four residual layers and four convolution layers with alternate ascending and descending dimensions, the middle features are spliced to obtain features with the shape of H/8 xW/8 x352, and the features extracted from the left view and the right view are respectively marked as f _l 、f _r ；

Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, use is made of an image I down-sampled to the resolution of 1/2 of the original left view _1/2 And f _g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 _g 'reducing the number of channels to 1 by carrying out 3 multiplied by 3 convolution on fg' with a normalization layer to obtain a final guide graph G;

step 4, cost aggregation: in the stereo matching network, cost aggregation is performed by using a plurality of stacked hourglass structures, and in order to improve efficiency, cost aggregation is performed by using one hourglass structure, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation _l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;

step 5, cost upsampling: the guide map G obtained in the step 2 and C obtained in the step 4 are compared _l As the input of bilateral grid learning, the cost space C with high resolution is finally obtained by bilateral grid up-sampling _H The shape is 16 multiplied by 25 multiplied by 1/2H multiplied by 1/2W;

step 6And parallax regression: a high resolution cost space C is obtained _H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax _l The shape is 1/2 Hx 1/2W:

where k is the disparity level, set to 1/2D _max ；

Step 7, refining the full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement _l Carrying out bilinear interpolation to obtain parallax d of full resolution _h Then to d _h Performing full resolution parallax refinement, specifically on the input d _h 、I _l And I _r Combining the wrap operation and the splicing operation, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d _final After full resolution disparity prediction in the training phase, d is calculated _h As intermediate supervision and with the final predicted disparity d _final Collectively as predicted values for the model;

step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, smooth L1 Loss which is relatively robust to outliers and abnormal values and small in gradient change is used for constructing Loss, and the formula is as follows:

the overall loss function consists of three parts:

wherein, d _gt Being the true disparity at the p-point,

a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used in a deep learning stereo matching method provide real parallax, and when the real parallax provided by the data set is sparse, the total Loss function plus Loss function is added _pseudo ，Loss _pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:

last is edge Loss _edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E ₁ (ii) a Then using a 5 x 5 rectangular region pair E ₁ Expansion is carried out to obtain a final edge map E ₂ ，

Loss of edge Loss _edges The calculation formula of (a) is as follows:

Loss _total ＝w ₁ ·Loss _gt +w ₂ ·Loss _pseudo +w ₃ ·Loss _edges ，

and 9, updating parameters: updating parameters by using an Adam optimizer, and using multi-step attenuation for learning rate attenuation;