CN115170921A - Binocular stereo matching method based on bilateral grid learning and edge loss - Google Patents

Binocular stereo matching method based on bilateral grid learning and edge loss Download PDF

Info

Publication number
CN115170921A
CN115170921A CN202210794705.XA CN202210794705A CN115170921A CN 115170921 A CN115170921 A CN 115170921A CN 202210794705 A CN202210794705 A CN 202210794705A CN 115170921 A CN115170921 A CN 115170921A
Authority
CN
China
Prior art keywords
loss
parallax
edge
cost
pseudo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210794705.XA
Other languages
Chinese (zh)
Inventor
陈明
闭韦杰
张正钦
容仕军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN202210794705.XA priority Critical patent/CN115170921A/en
Publication of CN115170921A publication Critical patent/CN115170921A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing

Abstract

The invention discloses a binocular stereo matching method based on bilateral grid learning and edge loss, which comprises the following steps of: step 1, feature extraction; step 2, constructing a guide map; step 3, constructing a cost space; step 4, cost aggregation; step 5, cost up-sampling; step 6, parallax regression; step 7, refining the full-resolution parallax; step 8, calculating loss; and 9, updating the parameters. The method has high accuracy of parallax estimation of the edge area, reduces errors and can carry out real-time reasoning.

Description

Binocular stereo matching method based on bilateral grid learning and edge loss
Technical Field
The invention relates to the field of three-dimensional reconstruction and stereoscopic vision, in particular to a binocular stereo matching method based on bilateral grid learning and edge loss.
Background
With the development of science and technology, two-dimensional plane information gradually cannot meet the application requirements of people in some aspects, such as automatic driving, robot autonomous navigation, face recognition, reverse engineering and the like. In the two-dimensional image acquisition process, an important scene clue of depth is lost, so that a machine cannot realize full understanding of a real scene. Technologies such as stereoscopic vision, time Of flight (Time Of flight), and structured light technology have become hot Of research in order to acquire depth information.
Compared with other technologies, the stereoscopic vision technology has the advantages of low cost, high efficiency, strong adaptability and the like, and is still one of key technologies in the field of three-dimensional reconstruction at present. In stereo vision, stereo vision that uses only two cameras (or two images) for reconstruction is often referred to as binocular stereo vision.
The binocular stereo matching technology is a key step in the binocular stereo vision technology, and the stereo matching result determines the quality of three-dimensional reconstruction to a great extent. Compared with the traditional method mainly adopting manual design characteristics, the appearance of deep learning provides a new solution for researchers: and stereo matching is performed by utilizing strong learning ability of deep learning. In recent years, more efficient and robust features can be obtained by deep learning based algorithms, which have gradually gained more accuracy than traditional methods.
Among the disclosed technologies, publication No. CN 112150521A, entitled binocular vision stereo matching method based on void convolution and cascade cost convolution, publication No. CN 111833386A, entitled pyramid binocular stereo matching method based on multi-scale information and attention mechanism, and other technologies, often adopt the structure of void convolution and pyramid-like when constructing a neural network. Both techniques tend to cause the neural network to produce a grid effect and blurred edges. Due to such defects, these networks tend to have a flat area with a relatively low Endpoint Error (Endpoint Error), but edge areas with a high Endpoint Error. BGNet (Xu B, xu Y, yang X, et al. Bilateral grid learning for stereo matching networks [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2021: 12497-12506.) based on bilateral grid upsampling is the first work of introducing bilateral grid learning to perform stereo matching, a model can perform cost aggregation under extremely low resolution through the designed bilateral grid cost upsampling module, and finally the model is upsampled to have high resolution, so that the model reasoning speed is greatly accelerated and has certain edge protection capability. However, the single-channel picture input removes excessive information, so that the quality of the extracted features is reduced; the final predicted disparity of the model is derived from bilinear upsampling, introducing a checkerboard artifact. Therefore, the final endpoint error of the network on the Scene Flow data set test set is 1.17, and the accuracy is low.
Disclosure of Invention
The invention aims to provide a binocular stereo matching method based on bilateral mesh learning and edge loss aiming at the defects in the prior art. The method has high accuracy of parallax estimation of the edge area, can reduce errors and can carry out real-time reasoning.
The technical scheme for realizing the purpose of the invention is as follows:
a binocular stereo matching method based on bilateral mesh learning and edge loss comprises the following steps:
step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network l And I r In which I l And I r Respectively a left view and a right view, using the structure of residual error as a feature extraction module to extract the features of the left view and the right view, wherein the weight of the feature extraction module is shared, and in the shallow part of the feature extraction module, four convolution layers constructed by 3 multiplied by 3 convolution are used for extracting initial shallow features f with the shape of 1/2H multiplied by 1/2W multiplied by 12 g Then, after four residual layers and four convolution layers with alternate ascending and descending dimensions, the intermediate features are spliced to obtain the features with the shape of H/8 xW/8 x 352,the extracted features of the left and right views are respectively marked as f l 、f r
Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, an image I down-sampled to the original left view 1/2 resolution is used 1/2 And f g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 g 'reducing the number of channels to 1 by carrying out 3 × 3 convolution on fg' with a normalization layer to obtain a final guide map G;
step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, the cost space is constructed by a group-by-group correlation method instead of direct splicing or mixed use of the two, and a group-by-group correlation calculation formula is as follows:
Figure BDA0003735226070000021
wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channel, (x, y) are element coordinates, the number of groups is set to be 44, and the parallax D when the group-by-group correlation is constructed is 1/8D max ,D max Is a preset maximum parallax;
step 4, cost aggregation: in the stereo matching network, a plurality of stacked hourglass structures are used for cost aggregation, and in order to improve efficiency, one hourglass structure is used for cost aggregation, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;
step 5, cost upsampling: the bilateral grid learning is a learnable bilateral grid, which has the advantages of high bilateral grid speed and strong edge protection capability, and the guide graph and the cost space of the bilateral grid are learnable modules, so that the generalization capability is strong, and specifically, the guide graph G obtained in the step 2 and the C obtained in the step 4 are combined l As an input of the bilateral grid learning, a high-resolution generation is finally obtained through bilateral grid upsampling (realized through grid _ sample function of pytorech)Valence space C H The shape is 16 multiplied by 25 multiplied by 1/2H multiplied by 1/2W;
step 6, parallax regression: a high resolution cost space C is obtained H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax l The shape is 1/2 Hx 1/2W:
Figure BDA0003735226070000031
where k is the disparity level, set to 1/2D max
Step 7, refining the full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement l Carrying out bilinear interpolation to obtain parallax d of full resolution h Then to d h Performing full resolution parallax refinement, specifically on the input d h 、I l And I r Combining the wrap operation and the splicing operation together, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d final After full resolution disparity prediction in the training phase, d is calculated h As intermediate supervision and with final predicted disparity d final Collectively as the predicted value of the model;
step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, and the Loss is constructed by using Smooth L1 Loss which is relatively robust to outliers, abnormal values and relatively small in gradient change, and the formula is as follows:
Figure BDA0003735226070000032
the overall loss function consists of three parts:
loss function Loss gt The Smooth L1 Loss is calculated for the predicted value and the real value, and the specific formula is as follows:
Figure BDA0003735226070000033
wherein, d gt Is the true disparity at the point p,
providing real parallax by adopting a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used by a deep learning stereo matching method, wherein the Scene Flow data set is a large-scale synthetic data set and provides dense real parallax; the KITTI 2012 data set and the KITTI 2015 data set are data sets of real street view, providing sparse real parallax;
when the true disparity provided by the dataset is sparse, the total penalty plus Loss function pseudo ,Loss pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:
Figure BDA0003735226070000041
wherein, P pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d pseudo Representing a pseudo-disparity at a p-point;
finally, loss of edge Loss edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E 1 (ii) a Then use the 5 x 5 rectangular region pair E 1 Expansion is carried out to obtain a final edge map E 2
Loss of edge Loss edges The calculation formula of (a) is as follows:
Figure BDA0003735226070000042
wherein, P E Shows an edge map E 2 Set of points contained in the middle edge region, d E Representing the true disparity at the p point of the edge region;
total Loss total Is Loss gt 、Loss pseudo And Loss edges The sum is expressed by the following formula:
Loss total =w 1 ·Loss gt +w 2 ·Loss pseudo +w 3 ·Loss edges
wherein w 1 、w 2 And w 3 Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;
during training, d is calculated separately h And d final Total loss of (d) is obtained as loss 1 、loss 2 Finally, 0.8 × loss is used 1 +1.0×loss 2 Counter propagating the value of (a);
step 9, updating parameters: the Adam optimizer is used for updating parameters, and the multi-step attenuation is used for the attenuation of the learning rate;
dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in the training process, and the test set is used for evaluating the quality of the final model prediction parallax;
and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.
Different from other methods, the technical scheme fully considers inference speed and parallax prediction of edge regions, and makes the following innovation: the convolution operation in the convolution neural network can blur edges, rich detail information is reserved for reducing the influence of blurring, and a guide graph of bilateral grid learning is constructed in a mode of extracting shallow features and fusing original images in a crossed mode in a feature extraction stage; in order to ensure the real-time performance of the network while improving the precision, only a group-by-group related construction mode is used when a cost space is constructed, bilateral grid learning is adopted to realize 3d cost aggregation under low resolution, and a loss function without extra inference time consumption is used; the parallax of the edge region belongs to a region difficult to predict in the whole visual field, so that the prediction error of the parallax is mostly concentrated in the edge region based on the stereo matching network of deep learning, and edge loss lossieges is introduced in order to concentrate the emphasis of a loss function in the edge region difficult to predict.
The technical scheme has the following specific effects: by optimizing the parallax prediction of the edge region by using bilateral learning with strong edge protection capability and edge loss, the end point error of the edge region can be obviously reduced while the end point error of the flat region is reduced, and real-time reasoning can be performed.
The method has high accuracy of parallax estimation of the edge area, reduces errors and can carry out real-time reasoning.
Drawings
Fig. 1 is a schematic diagram of a stereo matching network architecture according to an embodiment;
FIG. 2 is a flowchart of a stereo matching method according to an embodiment;
fig. 3 is a schematic diagram of a model training process according to an embodiment.
Detailed Description
The invention will be described in further detail with reference to the following drawings and specific examples, but the invention is not limited thereto.
Example (b):
referring to fig. 2, a binocular stereo matching method based on bilateral mesh learning and edge loss includes the following steps:
step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network l And I r In which I l And I r The left view and the right view are respectively, and before data is input into the model, the following processing needs to be carried out on the left view and the right view: stereo correction, random clipping, random color adjustment (including contrast, gamma, brightness, hue, saturation), random vertical flipping, creation of edge maps, normalization to [0,1.0 ]]And regularizing by using the mean value and variance of ImageNet, wherein the random clipping, the random color adjustment and the random vertical inversion all have 1/2 probability to be used, preprocessing the obtained left and right views, inputting the left and right views into a stereo matching network, as shown in figure 1, extracting the features of the left and right views by using a residual structure as a feature extraction module, wherein the weight of the feature extraction module is shared, and in the shallow part of the feature extraction module, extracting the obtained initial shallow feature f with the shape of 1/2H multiplied by 1/2W multiplied by 12 by using four convolution layers constructed by 3 multiplied by 3 convolution g Then, through four residual layers and four convolution layers with alternate ascending and descending dimensions, the middle features are spliced to obtain features with the shape of H/8 xW/8 x352, and the features extracted from the left view and the right view are respectively marked as f l 、f r
Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, an image I down-sampled to the original left view 1/2 resolution is used 1/2 And f g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 g 'reducing the number of channels to 1 by carrying out 3 × 3 convolution on fg' with a normalization layer to obtain a final guide map G;
step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, a cost space is constructed by a group-by-group correlation method, and a group-by-group correlation calculation formula is as follows:
Figure BDA0003735226070000061
wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channels, (x, y) are element coordinates, the number of groups is set to 44, and the parallax D when group-by-group correlation is constructed is 1/8D max ,D max Is a preset maximum parallax;
step 4, cost polymerization: in the stereo matching network, a plurality of stacked hourglass structures are used for cost aggregation, and in order to improve efficiency, one hourglass structure is used for cost aggregation, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;
step 5, cost upsampling: the bilateral grid learning is a learnable bilateral grid, which has the advantages of high bilateral grid speed and strong edge protection capability, and the guide graph and the cost space of the bilateral grid are learnable modules, so that the generalization capability is strong, and specifically, the guide graph G obtained in the step 2 and the C obtained in the step 4 are combined l As input for bilateral grid learningFinally, the high-resolution cost space C is obtained through bilateral grid upsampling (realized through grid _ sample function of Pythrch) H The shape is 16 × 25 × 1/2H × 1/2W;
step 6, parallax regression: a high resolution cost space C is obtained H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax l Shape 1/2H × 1/2W:
Figure BDA0003735226070000062
where k is the parallax level and is set to 1/2D max
And 7, refining full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement l Carrying out bilinear interpolation to obtain parallax d of full resolution h Then to d h Performing full resolution parallax refinement, specifically on the input d h 、I l And I r Combining the wrap operation and the splicing operation together, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d final After full resolution disparity prediction in the training phase, d is calculated h As intermediate supervision and with the final predicted disparity d final Collectively as the predicted value of the model;
step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, and the Loss is constructed by using Smooth L1 Loss which is relatively robust to outliers, abnormal values and relatively small in gradient change, and the formula is as follows:
Figure BDA0003735226070000071
the overall loss function consists of three parts:
loss function Loss gt The Smooth L1 Loss is calculated for the predicted value and the true value, and the specific formula is as follows:
Figure BDA0003735226070000072
wherein d is gt Being the true disparity at the p-point,
providing real parallax by adopting a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used by a deep learning stereo matching method, wherein the Scene Flow data set is a large-scale synthetic data set and provides dense real parallax; the KITTI 2012 and KITTI 2015 data sets are data sets of real street views, providing sparse real parallax;
when the true disparity provided by the dataset is sparse, the total penalty function plus Loss pseudo ,Loss pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:
Figure BDA0003735226070000073
wherein, P pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d pseudo Representing a pseudo-parallax at a p point;
finally, loss of edge Loss edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E 1 (ii) a Then using a 5 x 5 rectangular region pair E 1 Expansion is carried out to obtain a final edge map E 2
Loss of edge Loss edges The calculation formula of (a) is as follows:
Figure BDA0003735226070000081
wherein, P E Shows an edge map E 2 Set of points contained in the middle edge region, d E Representing the true disparity at the p point of the edge region;
total Loss total Is Loss gt 、Loss pseudo And Loss edges The sum is expressed by the following formula:
Loss total =w 1 ·Loss gt +w 2 ·Loss pseudo +w 3 ·Loss edges
wherein w 1 、w 2 And w 3 Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;
during training, d is calculated separately h And d final Total loss of (c) is obtained as loss 1 、loss 2 Finally, 0.8 × loss is used 1 +1.0×loss 2 The value of (a) is propagated backwards;
step 9, updating parameters: the model training process is shown in FIG. 3, using the Adam optimizer (β) in this example 1 =0.9,β 2 = 0.999), the update of learning rate is a multi-step update (MultiStepLR), the step size is set to (20, 30,40,50,60, 70), and each time the attenuation coefficient is 0.1;
dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in a training process, and the test set is used for evaluating the quality of the final model prediction parallax;
and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.

Claims (1)

1. A binocular stereo matching method based on bilateral mesh learning and edge loss is characterized by comprising the following steps:
step 1, feature extraction: inputting RGB image I with shape of H × W × 3 in network l And I r In which I l And I r Respectively a left view and a right view, extracting the features of the left view and the right view by using a residual structure as a feature extraction module, wherein the weight of the feature extraction module is shared, and four convolution layers constructed by 3 multiplied by 3 convolution are extracted at the shallow part of the feature extraction moduleInitial shallow feature f in the shape of 1/2 Hx 1/2 Wx 12 g Then, through four residual layers and four convolution layers with alternate ascending and descending dimensions, the middle features are spliced to obtain features with the shape of H/8 xW/8 x352, and the features extracted from the left view and the right view are respectively marked as f l 、f r
Step 2, constructing a guide map: to preserve the detail information, before constructing the guide map, use is made of an image I down-sampled to the resolution of 1/2 of the original left view 1/2 And f g Fusing to obtain f with a shape of 1/2 Hx 1/2 Wx 3 g 'reducing the number of channels to 1 by carrying out 3 multiplied by 3 convolution on fg' with a normalization layer to obtain a final guide graph G;
step 3, constructing a cost space: in order to reduce the parameter quantity and improve the reasoning speed, a cost space is constructed by a group-by-group correlation method, and a group-by-group correlation calculation formula is as follows:
Figure FDA0003735226060000011
wherein D is the parallax size of the cost space, g is the number of groups divided by the characteristic channel, (x, y) are element coordinates, the number of groups is set to be 44, and the parallax D when the group-by-group correlation is constructed is 1/8D max ,D max Is a preset maximum parallax;
step 4, cost aggregation: in the stereo matching network, cost aggregation is performed by using a plurality of stacked hourglass structures, and in order to improve efficiency, cost aggregation is performed by using one hourglass structure, specifically: the hourglass structure is a u-net-like network structure constructed by using 3D convolution, splicing operation after jump connection is replaced by summation operation so as to reduce calculation cost, and a low-resolution cost space C is obtained after cost aggregation l The shape is 16 multiplied by 25 multiplied by 1/8H multiplied by 1/8W;
step 5, cost upsampling: the guide map G obtained in the step 2 and C obtained in the step 4 are compared l As the input of bilateral grid learning, the cost space C with high resolution is finally obtained by bilateral grid up-sampling H The shape is 16 multiplied by 25 multiplied by 1/2H multiplied by 1/2W;
step 6And parallax regression: a high resolution cost space C is obtained H Then, the parallax d is regressed by soft argmin to obtain the predicted parallax l The shape is 1/2 Hx 1/2W:
Figure FDA0003735226060000012
where k is the disparity level, set to 1/2D max
Step 7, refining the full-resolution parallax: to avoid the checkerboard artifact effect, d is first aligned before disparity refinement l Carrying out bilinear interpolation to obtain parallax d of full resolution h Then to d h Performing full resolution parallax refinement, specifically on the input d h 、I l And I r Combining the wrap operation and the splicing operation, then inputting an hourglass-shaped structure, and gradually thinning to obtain the final predicted parallax d final After full resolution disparity prediction in the training phase, d is calculated h As intermediate supervision and with the final predicted disparity d final Collectively as predicted values for the model;
step 8, calculating loss: in the training stage, after parallax prediction is completed, loss is calculated, smooth L1 Loss which is relatively robust to outliers and abnormal values and small in gradient change is used for constructing Loss, and the formula is as follows:
Figure FDA0003735226060000021
the overall loss function consists of three parts:
loss function Loss gt The Smooth L1 Loss is calculated for the predicted value and the real value, and the specific formula is as follows:
Figure FDA0003735226060000022
wherein, d gt Being the true disparity at the p-point,
a Scene Flow data set, a KITTI 2012 data set and a KITTI 2015 data set which are commonly used in a deep learning stereo matching method provide real parallax, and when the real parallax provided by the data set is sparse, the total Loss function plus Loss function is added pseudo ,Loss pseudo The Smooth L1 Loss is calculated according to the predicted value and the pseudo-parallax, the pseudo-parallax is obtained through the prediction of a mature model, and the formula is as follows:
Figure FDA0003735226060000023
wherein, P pseudo Representing sets of points for which a true label is not given and which are predicted using a maturity model, d pseudo Representing a pseudo-disparity at a p-point;
last is edge Loss edges In order to make the designed network model focus on the edge region in the disparity estimation, an additional Smooth L1 Loss is calculated for the edge region, and the construction process of the edge region is as follows: using Canny operator to carry out edge detection on the left view to obtain an edge image E 1 (ii) a Then using a 5 x 5 rectangular region pair E 1 Expansion is carried out to obtain a final edge map E 2
Loss of edge Loss edges The calculation formula of (a) is as follows:
Figure FDA0003735226060000031
wherein, P E Shows an edge map E 2 Set of points contained in the middle edge region, d E Representing the true disparity at the p point of the edge region;
total Loss total Is Loss gt 、Loss pseudo And Loss edges The sum is expressed by the following formula:
Loss total =w 1 ·Loss gt +w 2 ·Loss pseudo +w 3 ·Loss edges
wherein w 1 、w 2 And w 3 Is a preset hyper-parameter which is respectively set as 1.0, 1.0 and 1.0;
during training, d is calculated separately h And d final Total loss of (c) is obtained as loss 1 、loss 2 Finally, 0.8 × loss is used 1 +1.0×loss 2 The value of (a) is propagated backwards;
and 9, updating parameters: updating parameters by using an Adam optimizer, and using multi-step attenuation for learning rate attenuation;
dividing a data set into a training set, a verification set and a test set, wherein the training set and the verification set are used in the training process, and the test set is used for evaluating the quality of the final model prediction parallax;
and (5) repeating the steps 1 to 9 for multiple times until the model is converged, finishing training, and obtaining the model on the verification set, namely the final stereo matching model for the stereo matching task.
CN202210794705.XA 2022-07-07 2022-07-07 Binocular stereo matching method based on bilateral grid learning and edge loss Withdrawn CN115170921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210794705.XA CN115170921A (en) 2022-07-07 2022-07-07 Binocular stereo matching method based on bilateral grid learning and edge loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210794705.XA CN115170921A (en) 2022-07-07 2022-07-07 Binocular stereo matching method based on bilateral grid learning and edge loss

Publications (1)

Publication Number Publication Date
CN115170921A true CN115170921A (en) 2022-10-11

Family

ID=83490332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210794705.XA Withdrawn CN115170921A (en) 2022-07-07 2022-07-07 Binocular stereo matching method based on bilateral grid learning and edge loss

Country Status (1)

Country Link
CN (1) CN115170921A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128946A (en) * 2022-12-09 2023-05-16 东南大学 Binocular infrared depth estimation method based on edge guiding and attention mechanism

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128946A (en) * 2022-12-09 2023-05-16 东南大学 Binocular infrared depth estimation method based on edge guiding and attention mechanism
CN116128946B (en) * 2022-12-09 2024-02-09 东南大学 Binocular infrared depth estimation method based on edge guiding and attention mechanism

Similar Documents

Publication Publication Date Title
CN110490919B (en) Monocular vision depth estimation method based on deep neural network
CN110443842B (en) Depth map prediction method based on visual angle fusion
CN108376392B (en) Image motion blur removing method based on convolutional neural network
CN110782490A (en) Video depth map estimation method and device with space-time consistency
CN111739077A (en) Monocular underwater image depth estimation and color correction method based on depth neural network
CN108765479A (en) Using deep learning to monocular view estimation of Depth optimization method in video sequence
CN110070489A (en) Binocular image super-resolution method based on parallax attention mechanism
CN113592026A (en) Binocular vision stereo matching method based on void volume and cascade cost volume
CN113837946B (en) Lightweight image super-resolution reconstruction method based on progressive distillation network
CN109949354B (en) Light field depth information estimation method based on full convolution neural network
CN111508013A (en) Stereo matching method
CN112422870B (en) Deep learning video frame insertion method based on knowledge distillation
CN115393410A (en) Monocular view depth estimation method based on nerve radiation field and semantic segmentation
CN112580473A (en) Motion feature fused video super-resolution reconstruction method
CN116958534A (en) Image processing method, training method of image processing model and related device
CN114638842B (en) Medical image segmentation method based on MLP
Xu et al. AutoSegNet: An automated neural network for image segmentation
CN115239564A (en) Mine image super-resolution reconstruction method combining semantic information
CN115511708A (en) Depth map super-resolution method and system based on uncertainty perception feature transmission
CN116563459A (en) Text-driven immersive open scene neural rendering and mixing enhancement method
CN115170921A (en) Binocular stereo matching method based on bilateral grid learning and edge loss
CN112116646B (en) Depth estimation method for light field image based on depth convolution neural network
CN112785502B (en) Light field image super-resolution method of hybrid camera based on texture migration
CN104143203A (en) Image editing and communication method
CN115578260A (en) Attention method and system for direction decoupling for image super-resolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20221011

WW01 Invention patent application withdrawn after publication