CN111709977A

CN111709977A - Binocular depth learning method based on adaptive unimodal stereo matching cost filtering

Info

Publication number: CN111709977A
Application number: CN202010185728.1A
Authority: CN
Inventors: 百晓; 张友敏; 于洋; 安冬; 石翔
Original assignee: Qingdao Research Institute Of Beijing University Of Aeronautics And Astronautics; Goertek Robotics Co Ltd
Current assignee: Qingdao Research Institute Of Beijing University Of Aeronautics And Astronautics; Goertek Robotics Co Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-09-25

Abstract

The application discloses a binocular depth learning method based on adaptive unimodal stereo matching cost filtering, which is characterized in that: the method is characterized in that unimodal distribution supervision with real parallax as the center is directly applied to the matching cost predicted by the network, and adaptive matching cost filtering is realized, and the method comprises the following steps: 1) constructing a data set, wherein the data set comprises a left image and a right image, and the left image and the right image are used as a stereo image pair; 2) inputting a stereo image pair into a PSMNET stereo matching model basic network by taking PSMNT as the stereo matching model basic network, and outputting three matching Cost bodies (Cost volumes) aggregated by a stacked hourglass 3D convolutional neural network by the PSMNT stereo matching model basic network; 3) for each matching Cost body (Cost Volume), a Confidence evaluation Network (Confidence evaluation Network) is used for estimating a Confidence map and adjusting a real matching Cost body (Ground Truth Cost Volume) to generate a Unimodal Distribution (Unimodal Distribution) of a pixel level as a Network training mark. The invention has the advantages that the defects in the prior art can be overcome, and the structural design is reasonable and novel.

Description

Binocular deep learning method based on adaptive unimodal stereo matching cost filtering

Technical Field

The invention relates to a binocular deep learning method based on self-adaptive unimodal stereo matching cost filtering, and belongs to the technical field of binocular stereo matching visual image processing.

Background

Binocular stereo vision obtains rich three-dimensional stereo data, especially depth information, by mimicking the principles of human vision. Through the development of many years, binocular stereo vision plays a great role in the fields of industrial measurement, three-dimensional reconstruction, unmanned driving and the like. The binocular stereo vision is a method for acquiring three-dimensional geometric information of an object by acquiring two images of the object to be measured from different positions by using imaging equipment based on the principle of parallax and calculating the position deviation between corresponding points of the images. The binocular stereo matching process generally comprises four steps: calculating matching cost, aggregating the matching cost, calculating a disparity map and optimizing the disparity map. Wherein the matching cost calculation is the core part of the whole algorithm. The traditional stereo method generally adopts manually designed image characteristics and cost functions to calculate matching cost, and due to the limitation of manual design, the obtained stereo matching result has weak anti-interference capability and limited applicable scenes. In recent years, many convolutional neural network-based stereo matching methods propose image feature extraction and cost function learning modeling as network layers. For example, the DispNet C proposes that a correlation layer is used as an approach of a cost function, then the image feature extraction is learned through a parallax regression loss constraint network, and as too much information is lost in the process of calculating the matching cost by the correlation layer, the binocular matching result precision is low; the GCNet further releases the flexibility of network learning image features and cost functions, the left and right image features are connected in channel dimensions, a series of three-dimensional convolutional layer learning matching cost calculation is utilized, however, the end-to-end network design is subjected to network learning through parallax regression (regression by soft argmin function) loss supervision, clear constraints are not provided for the matching cost calculation process, and therefore image feature extraction and cost calculation functions cannot be effectively learned.

Disclosure of Invention

The invention provides a binocular deep learning method (AcfNet) based on self-adaptive unimodal stereo matching cost filtering, which improves the existing stereo matching method based on a convolutional neural network and directly supervises and learns the matching cost calculation process.

In order to solve the technical problems, the invention adopts the technical scheme that a binocular deep learning method based on adaptive unimodal stereo matching cost filtering directly applies unimodal distribution supervision with real parallax as the center to the matching cost predicted by a network to realize adaptive matching cost filtering, and comprises the following steps of:

1) constructing a data set, wherein the data set comprises a left image and a right image, and the left image and the right image are used as a stereo image pair;

2) inputting a stereo image pair into a PSMNET stereo matching model basic network by taking PSMNT as the stereo matching model basic network, and outputting three matching Cost bodies (Cost volumes) aggregated by a stacked hourglass 3D convolutional neural network by the PSMNT stereo matching model basic network;

3) for each matching Cost body (Cost Volume), respectively estimating a confidence map by using a confidence evaluation Network (confidence evaluation Network) and adjusting a real matching Cost body (Ground Truth Costvolume) to generate a Unimodal Distribution (Unimodal Distribution) of a pixel level as a Network training mark;

4) proposing a matching cost body of a Stereo Focal Loss (Stereo Focal local) constraint estimation and a real matching cost body;

5) generating a disparity map of sub-pixels from the estimated matching cost volume by the Soft Argmin function and using a regressive L₁Loss supervised estimated disparity maps and true disparity maps.

In the technical scheme of the application, under an ideal condition, the matching cost distribution of each pixel point is unimodal distribution with real parallax as a center. In order to explicitly constrain the network to learn this Cost distribution to learn more robust image features and Cost computation functions, we propose to generate a unimodal Cost distribution centered on true disparity for each pixel from the true disparity map and to apply direct supervision on the matching Cost Volume (Cost Volume) of the network prediction. In order to reveal the matching uncertainty of each pixel point, a confidence degree estimation network is designed to estimate the confidence degree of each pixel point and is used for adjusting the corresponding real unimodal distribution. The confidence evaluation network is the core for realizing the adaptive matching cost filtering, and can adaptively adjust the smoothness of unimodal distribution, namely the distribution variance according to the learning difficulty of the network.

Optimized, the binocular deep learning method based on adaptive unimodal stereo matching cost filtering performs real unimodal matching cost distribution generation in step 2), that is, for each pixel point p in the reference image, a series of pixel points to be matched are searched on the same polar line of the target image, a matching cost body reflects the similarity of the pixel pairs to be matched, the matching cost between real matching pairs should be minimum, and the matching cost of disparity values of other parameters should increase with the distance from real disparity; the matching cost distribution of each pixel should be centered on the true disparity.

In the optimized binocular deep learning method based on the adaptive unimodal stereo matching cost filtering, in the step 3), a confidence evaluation network is constructed to estimate the confidence level of the matching cost predicted by the trunk network, and the confidence evaluation network adaptively adjusts the smoothness of the unimodal distribution, namely adaptively adjusts the variance of the unimodal distribution according to the learning difficulty of the network.

In the optimized binocular deep learning method based on the adaptive unimodal stereo matching cost filtering, in the step 4), stereo focusing loss calculation is performed, and D matching costs { C ] are constructed for each pixel point by the matching cost body₀，C₁，···C_D-1}, i.e. matching cost distributions; for a pixel point pFor the estimated matching cost distribution

And true matching cost distribution P_p(d) And similarity between the estimated matching cost distribution and the real matching cost distribution by using the cross entropy loss unbalance amount is used as a network supervision item.

In the optimized binocular deep learning method based on adaptive unimodal stereo matching cost filtering, in the step 2), in the generation of the real unimodal matching cost distribution, the parallax d search set in the target image is assumed to be {0,1, LD-1}, wherein the real parallax value is d^gtThe true unimodal distribution is defined as:

wherein

σ>0 is a variance for controlling the degree of sharpness of the peak shape around the true parallax.

Optimally, in the self-confidence evaluation network in the step 3), the self-confidence evaluation network firstly consists of a convolution layer of 3 × 3, a normalization layer and a ReLU layer, and then outputs a value belonging to [0,1 ] in another convolution layer of 1 × 1 and the sigmoid function]The network directly outputs the confidence degree chart f ∈ [0,1 ] for the input aggregated matching cost body]^H×WWherein H, W are image height, width, respectively.

In the optimized binocular deep learning method based on the adaptive unimodal stereo matching cost filtering, in the confidence evaluation network in the step 3), for the pixel point p, the variance of the true matching cost distribution can be estimated from the estimated confidence value f_pAnd (3) carrying out dynamic adjustment: sigma_p＝s(1-f_p) Plus, where s is greater than or equal to 0 and is a constant, reflecting the variance σ to the confidence value change f_pVarying degrees of sensitivity, and ∈ > 0 defines the lower bound of σ, and correspondingly σ_p∈[,s+]。

For a pixel p, if its predicted confidence f_pVery large means that the network can find its unique matching points with great confidence; conversely, if its predicted confidence value is small, it means that there is a match ambiguity. Thus, the variance of the true matching cost distribution can be dynamically adjusted by the estimated confidence value: sigma_p＝s(1-f_p)+. Wherein s is a constant value greater than or equal to 0 and reflects the variance sigma to the confidence value f_pVarying degree of sensitivity, and>0 defines the lower bound of σ and can effectively prevent the mathematical problem of dividing 0.

Optimally, in the binocular deep learning method based on the adaptive unimodal stereo matching cost filtering, in the stereo focusing loss calculation in the step 4), a weight factor focusing on the positive parallax loss is introduced to improve the cross entropy loss and finally form a mathematical form of the stereo focusing loss:

wherein α ≧ 0 is the focusing parameter, and when α ≧ 0, the loss function directly degenerates to the cross-entropy loss, when α>At 0, the stereoscopic focusing loss will be according to P_p(d) Assigns more weight to positive disparity samples.

In an optimized binocular deep learning method based on adaptive unimodal stereo matching cost filtering, the PSMNet stereo matching model base network includes a Spatial Pyramid Module (Spatial Pyramid filtering Module) for extracting image features, and the Spatial Pyramid Module (Spatial Pyramid filtering Module) extracts image features including multi-scale context information by setting 4 parallel average Pooling modules (Pooling) with fixed sizes; the PSmNet stereo matching model base network comprises a 3D CNN framework with an encoding and decoding structure in a hourglass shape, wherein the 3D CNN framework carries out repeated processing processes from top to bottom and from bottom to top and supervises and learns matching cost bodies output by the PSmNet stereo matching model base network in three stages.

The application has the advantages thatIn the following steps: in the conventional stereo matching method based on parallax regression, such as GCNet and PSmNet, the estimated matching cost distribution

Obtaining an estimated parallax value by regression through soft argmin function

In the stage of utilizing network training, for the pixel point p, the real parallax value is d_psmoothL is generally used₁Loss constraint:

wherein

And (5) supervising network learning.

Since the whole process is conductive, the network can directly use the real disparity map for supervised training. However, as can be seen from the mathematical formula of the soft argmin function, the matching cost body is only used as the weighting weight of the parallax interpolation process in the whole regression process, and only the real parallax value needs to be obtained, so that the matching cost body can participate in the parallax interpolation process in any state, and no requirement is imposed on the mathematical distribution of the matching cost body. This fact contradicts the fact that the matching cost distribution of each pixel should exhibit a unimodal distribution, and the direct reason is the lack of direct supervision constraints on the matching cost distribution, which also motivates us to propose an adaptive unimodal matching cost filtering scheme. The scheme provided by the application can restrain network learning and estimate matching cost distribution of a unimodal mode, and the matching cost is minimum at a real parallax value, namely the similarity is highest. Compared with the conventional PSmNet method which takes parallax regression loss as supervision, due to the fact that a clear unimodal constraint is not applied to matching cost in the network learning process, the matching cost distribution which is learned and estimated by the network not only presents multiple peaks, but also the parallax value corresponding to the two peaks with the minimum cost (the maximum similarity) is far from the real parallax value, and therefore the fact that the network does not learn a robust feature similarity judgment standard is shown. On the contrary, the scheme AcfNet can find out the most matched pixel points in the left and right images and estimate the maximum similarity probability.

The confidence evaluation network is the core for realizing the adaptive matching cost filtering, and can adaptively adjust the smoothness of unimodal distribution, namely the distribution variance according to the learning difficulty of the network. To quantitatively evaluate its performance, the present application employs sparsiconfiguration plots technology. It can reveal the coincidence of the predicted confidence evaluation result and the real error magnitude. Drawing sparse placement plots of AcfNet on a Scene Flow test set, and evaluating EPE errors of other pixel points after the part of pixel points with relatively low self-confidence is continuously removed; the Oracal curve corresponds to EPE errors of the rest pixels after the part of pixels with relatively large errors are continuously removed; meanwhile, according to an EPE error curve after the pixels are randomly removed, namely a Random curve, a confidence evaluation curve of the method is very close to an Oracla curve according to the curve, only 6.9% of the pixels are removed, the error is reduced by half, and the performance is far better than the performance of randomly removing the pixels. This fully demonstrates the superior performance of our confidence assessment in detecting and interpreting outliers.

Drawings

FIG. 1 is a schematic diagram of the overall network architecture framework of the present application;

fig. 2 is a schematic diagram of a PSMNet network structure according to the present application;

FIG. 3 is a graph of ablation test results for various parameters of the present application;

FIG. 4 is a distribution histogram of variance σ in a Scene Flow test set in the present application;

FIG. 5 is a schematic diagram of the effectiveness of variance adjustment in the present application;

fig. 6 is an illustration of matching cost distribution samples in the disparity dimension in the present application;

FIG. 7 is a diagram of qualitative evaluation results in a Scene Flow test set according to the present disclosure;

fig. 8 is a graph of a visualization result of the KITTI2012 in the present application;

fig. 9 is a visualization result diagram of the technical solution of the present application at KITTI 2015;

FIG. 10 is a table of an adaptive unimodal matching cost filtering effectiveness analysis of the present application;

FIG. 11 is a matching cost filtering comparison analysis table of the present application;

fig. 12 is a table comparing performance of the method in the prior art on three data sets, namely Scene Flow, KITTI2012 and KITTI2015 according to the technical solution of the present application.

Detailed Description

The technical features of the present invention will be further explained with reference to the accompanying drawings and specific embodiments.

As shown in the figure, the invention is a binocular deep learning method based on adaptive unimodal stereo matching cost filtering, and ideally, the matching cost distribution of each pixel point is unimodal distribution with real parallax as the center. In order to explicitly constrain the network to learn this Cost distribution to learn more robust image features and Cost computation functions, we propose to generate a unimodal Cost distribution centered on the true disparity for each pixel from the true disparity map and to apply direct supervision to the matching Cost Volume (Cost Volume) of the network prediction. In order to reveal the matching uncertainty of each pixel point, a confidence degree estimation network is designed to estimate the confidence degree of each pixel point and is used for adjusting the corresponding real unimodal distribution. Fig. 1 illustrates the overall network architecture framework of the present application. Since PSMNet is currently the stereo matching model using the most advanced technology, we adopt it as our base network. For the input left and right image pairs, the PSmNet outputs 3 matched Cost bodies (Cost volumes) aggregated by a stacked hourglass 3D convolutional neural network; for each matching Cost body (Costvolume), a Confidence degree graph is estimated by using a Confidence evaluation Network (Confidence evaluation Network) and used for adjusting a real matching Cost body (Ground Truth Cost Volume) respectively in the application, so as to generate Unimodal Distribution (Unimodal Distribution) at a pixel level as a Network training mark; and we propose that the Stereo Focal Loss (Stereo local) constrains the estimated and true matching cost volumes. Finally, a disparity map of the sub-pixels is generated from the estimated matching cost volume by the SoftArgmin function, and one regression L1 penalty is used to supervise the estimated and true disparity maps.

Stereo matching algorithm based on 3DCNN

In the corrected image pair, for each pixel point p (x, y) in the left image, the objective of binocular stereo matching is to find the corresponding point in the right image, i.e., p' (x + d, y), d ∈ R⁺For the sake of calculation and memory access, the disparity is generally discretized into a series of possible disparity reference values, namely {0,1, LD-1}, so that a matching Cost body (Cost Volume) of H × W × D can be constructed, where H, W, and D are the image height, width, and maximum disparity value, respectively.

For the adopted PSmNet model, the network structure is shown in FIG. 2, and the PSmNet model mainly comprises 4 parts, namely feature extraction, calculation of a matching cost, and matching cost aggregation and parallax regression based on a 3D Convolutional Neural Network (CNN).

It is very difficult to determine the context only from the pixel intensities, and semantic information containing object levels would be very beneficial for matching, especially for disparity estimation of ill-conditioned areas. In order to learn and extract the relative relationship between objects, the PSMNet proposes a Spatial Pyramid Module (Spatial Pyramid power Module) for image feature extraction, and the final image feature representation contains multi-scale context information through 4 parallel fixed-size average Pooling modules. The matching cost body is formed by adopting a GCNet connection mode, and the most original image information is reserved for left and right feature matching. In order to gather feature information in parallax dimension and spatial dimension, the PSmNet provides a 3D CNN framework with an hourglass-shaped encoding and decoding structure, which comprises repeated processing processes from top to bottom and from bottom to top, and also monitors and learns matching cost bodies output from three stages of a network. Overall, PSMNet achieves very superior stereo matching performance, which is also the reason for selecting the network framework as the underlying network of the present application.

Generally speaking, a matching Cost body (Cost Volume) constructs D matching costs { C ] for each pixel point₀，C₁，···C_D-1I.e. the matching cost distribution. To estimate the disparity values of the sub-pixels from this distribution, GCNet proposes to perform a regression using the soft argmin function:

wherein the content of the first and second substances,

the more the disparity value with the smallest matching cost contributes to the final interpolation result. In the network training stage, for the pixel point p, the real parallax value is d_psmoothL is generally used₁Loss constraint:

wherein

Since the whole process is conductive, the network can directly use the real disparity map for supervised training. However, as can be seen from the soft argmin regression process of the formula (3.1), the matching cost body is only used as the weighting weight of the parallax interpolation process, and only the true parallax value needs to be obtained, so that the matching cost body can participate in the parallax interpolation process in any state, and no requirement is imposed on the mathematical distribution of the matching cost body. This fact contradicts the fact that the matching cost distribution of each pixel should exhibit a unimodal distribution, and the direct reason is the lack of direct supervised constraints on the matching cost distribution, which also motivates us to propose an adaptive unimodal matching cost filtering scheme.

An adaptive unimodal matching cost filtering module:

as shown in fig. 1, the proposed AcfNet network structure can complete learning of a unimodal matching cost distribution by embedding only one adaptive unimodal matching cost filtering module on the basis of PSMNet. For 3 aggregated matching cost bodies output by the PSmNet, the adaptive filtering effect is realized by three parts of unimodal matching cost distribution generation, a confidence estimation network and stereo focusing loss.

Unimodal matching cost distribution generation

The matching cost body reflects the similarity of the pixel pairs to be matched, the matching cost between the real matched pairs should be minimum, and the matching cost of the parallax value of other parameters should increase along with the distance from the real parallax. This property requires that the matching cost distribution of each pixel should be centered around the true disparity. Given a true parallax d^gtUnimodal distribution is defined as:

wherein

σ>0 is variance, and the sharpness of the peak around the real parallax can be controlled.

Generally, the context information of each pixel is not the same. Therefore, it is not reasonable to keep a uniform true matching cost distribution p (d) for each pixel. For example, a pixel located at a corner of a table may be more biased toward a very sharp single peak, while a region with less texture may be more biased toward a relatively flat distribution. In order to establish a reasonable matching cost body, a confidence evaluation network is designed to adaptively regulate each imageUnimodal distribution variance σ of prime point_p。

Confidence estimation network

In the conventional confidence evaluation method, a great deal of research work is focused on researching an aggregated matching cost distribution curve, and further, outliers in a predicted disparity map are effectively detected and used for improving the prediction accuracy of the disparity map, in the prior art, matching cost filtering methods based on confidence guidance are proposed, the methods generally directly use confidence evaluation as prior information or additional characteristic information to optimize matching cost and the disparity map, but the method is directly used for adjusting the smoothness of real matching cost distribution according to the confidence score predicted by a network, so that each pixel can adaptively adjust the smoothness of unimodal distribution according to context information]^H×W. For a pixel point, if its predicted confidence f_pVery large means that the network can find its unique matching points with great confidence; conversely, if its predicted confidence value is small, it means that there is a match ambiguity. Thus, the variance of the true matching cost distribution can be dynamically adjusted by the estimated confidence value:

σ_p＝s(1-f_p)+

wherein s is a constant value greater than or equal to 0 and reflects the variance sigma to the confidence value f_pVarying degree of sensitivity, of>0 defines the lower bound of σ and can effectively prevent the mathematical problem of dividing 0. Corresponding sigma_p∈[,s+]. In our experiments, two types of pixels are likely to have large variance values σ: few textures and occluded pixels. For a region with few textures, a plurality of matched pixel points may exist; for the occluded pixels, the correct matching point cannot be found. Due to sigma of each pixel_pCan be dynamically adjusted to be trueThe cost counterpart can be modified accordingly according to equation (3.5).

Loss of stereo focus

For the pixel point p, the technical scheme in the application obtains the estimated matching cost distribution

And true matching cost distribution P_p(d) Calculating the distribution error with cross entropy loss is the most straightforward way. However, each pixel has a serious disparity sample imbalance problem, that is, each pixel has only one true disparity value (positive sample) and hundreds of unmatched disparity values (negative samples). Therefore, inspired by the fact that Focal local solves the sample imbalance in the one-stage target detection, the present application proposes a Stereo Focal Loss (Stereo Focal local) for the prediction of the focus positive samples, in case the network training is dominated by negative samples,

wherein α ≧ 0 is the focusing parameter, and when α ≧ 0, the loss function directly degenerates to the cross-entropy loss, when α>At 0, the Stereo Focal Loss (Stereo Focal Loss) will be according to P_p(d) Assigns more weight to positive disparity samples.

The overall loss function:

in summary, our final loss function contains a total of three parts:

wherein λ_regression，λ_confidenceTwo trade-off parameters.

The learning of the matching cost body is supervised,

supervising the return of the parallax

Then acting as a regularizer encourages more pixels to have a large confidence value,

the above is demonstrated experimentally as follows, the demonstration process being:

database and evaluation index and implementation details

(1) Database with a plurality of databases

For qualitative and quantitative evaluation of the methods proposed in this application, evaluations will be performed on three public datasets (Scene Flow, KITTI2012, KITTI 2015). The Scene Flow is a synthetic data set, which comprises 35454 training picture pairs and 4370 testing picture pairs, and provides dense real parallax labeling information, so that the Scene Flow is very suitable for training and testing network models. The KITTI2012 and the KITTI2015 are two real street view data sets, and the provided parallax labeling information is obtained by radar scanning, so that the parallax labeling information is sparse. The former comprises 194 training picture pairs and 195 test picture pairs; and the latter contains 200 training picture pairs and 200 test picture pairs. Both KITTI data sets are too small for training neural networks. Therefore, we refer to GC-Net to design ablation experiments mainly on Scene Flow and analyze the network design.

(2) Evaluation index

In the experiment, we used two standard evaluation indexes: 3-pixel-error (3PE), which refers to the percentage of the total number of pixels with predicted parallax and real parallax greater than 3 pixels; end-point-error (epe), refers to the average difference between the predicted disparity and the true disparity. EPE emphasizes sub-pixel errors, while 3PE emphasizes the percentage of outliers. Moreover, in order to further evaluate the performance of the method proposed in the present application on processing the occlusion region, we divided the SceneFlow test set into Occlusion (OCC) and non-Occlusion (OCC) according to left-right consistency checkAnd (4) notch regions (NOC). Firstly, label the left real disparity map D^LThe coordinate of the middle pixel point is p, and then the formula is as follows:

NOC if|d-D^R(p-d)|≤1ford＝D^L(V)，(3.10)

OOC otherwise. (3.11)

NOC if|d-D^R(p-d)|≤1ford＝D^L(p)，(3.10)

OOC otherwise(3.11)

wherein D^RAnd p-d is the right real disparity map, and the corresponding position p in the right map is shifted by d pixel values to the left. According to our statistics, occluded pixels account for 16% of the entire test set.

(3) Implementation details

The method of the application is realized by adopting PyTorch, and all models are trained end to end by adopting RMSProp standard setting. For all images in the data set, color normalization is adopted for data processing. At the time of training, H256 and W512 image blocks are randomly truncated, and the maximum disparity value D is set to 192. For network training, network parameters were randomly initialized and trained on Scene Flow for 10 cycles (Epoch) at a constant learning rate of 0.001 and tested directly with the trained model. For the KITTI dataset, fine-tuning was performed for 600 cycles (Epoch) using a pre-trained model on Scene Flow. The initial fine learning rate was set to 0.001 and decayed at 100 and 300 cycles

When submitted to the KITTI publication chart, training is extended to 20 cycles (Epoch) on Scene Flow for a better pre-trained model. The size of the training batch is 3, and 3 blocks of NVIDIA GTX 1080Ti GPUs are arranged in total, so that 1 batch of data is placed on each display card.

Experimental results and discussion

(1) Analyzing ablation experiment results:

all experiments were performed on a Scene Flow dataset, since there was sufficient data volume for network end-to-end training and no worry about overfitting issues. In all experiments, the Stereo Focal local performs positive-negative parallax sample equalization with α being 5.0. Considering that most parallax prediction errors are sub-pixel errors, namely the errors are smaller than 1 pixel point, and the 3-pixel error evaluated by the 3PE cannot accurately reveal the network performance, the method only utilizes the EPE error to research the performance difference of the network under different parameter settings.

Unimodal distribution variance σ analysis:

the size of the variance sigma reflects the sharpness of unimodal distribution, and plays a crucial role in the binocular deep learning method based on the adaptive unimodal stereo matching cost filtering mentioned in the application. In the method of the present application, the variance is mainly limited by the sum S, i.e., σ_p∈[,s+]。

First, a case where the variance σ is fixed, that is, the variances of all the pixels are the same value (s ═ 0, σ ═ is studied. Through grid search, the network prediction result is best when σ is 1.2. This also suggests that for most pixel points, they prefer to establish a unimodal distribution with σ ═ 1.2. Therefore, the lower limit of σ is set to 1.0 to explore adaptive variance learning.

Next, a variance sensitivity adjustment parameter S is investigated, which controls the upper limit of the variance σ. Fig. 3(a) shows the results obtained by adjusting the parameter S, where S is 1, the best effect is obtained, and the performance is quite stable when S is changed from 0.5 to 3.0. And, after the network converges, the variance distribution histogram of all the pixels in the Scene Flow test set when s is 1, i.e. σ ∈ [1.0,2.0] is shown in fig. 4. It can be seen that most of the pixels are biased to small variance, and some of the pixels require larger variance to smooth the single peak distribution.

Loss of equalization weight:

λ_confidencethe balance between the loss of the confidence network and other losses is adjusted, and the learning of the variance is also controlled implicitly. As can be seen from FIG. 3(b), when λ is varied_confidenceWhen the matching confidence of each pixel point is too high or too low, a worse result is caused, and when the lambda is larger than the threshold value, the matching confidence of each pixel point is smaller than the threshold value_confidenceThe best performance was obtained at 8.0.

λ_regressionBalanced is the widely used parallax regression loss in existing networks, and excessive lambda_regressionThe other two losses presented herein are eliminated. Fig. 3(c) shows the performance variation curve, and it can be seen that properly balancing the effect of the regression loss and the other two losses can greatly improve the network matching performance.

(2) Analysis of variance

The variance estimation is an important design for realizing the adaptive matching cost filtering in the application, and can adaptively adjust the smoothness of the single peak distribution according to the difficulty of network learning. In order to quantitatively evaluate the performance of the method, sparsifitationspots technology is adopted, and the matching of the self-confidence evaluation result predicted by us and the real error magnitude can be revealed. As shown in FIG. 5, Sparsication plots of AcfNet on Scene Flow test set were plotted. The graph shows EPE errors of the rest pixels after the part of pixels with relatively low confidence are continuously removed; the Oracal curve corresponds to EPE errors of the rest pixels after the part of pixels with relatively large errors are continuously removed; and simultaneously, an EPE error curve after the pixel points are randomly removed, namely a Random curve is also given. The result shows that the confidence evaluation curve of the method is very close to the Oracla curve, only 6.9% of pixel points are removed, the error is reduced by half, and the performance is far better than the situation of randomly eliminating the pixel points. This fully demonstrates the superior performance of the confidence evaluation of the present application in detecting and interpreting outliers. Furthermore, with reference to fig. 7, several visualization examples are given. It can be seen that the difficult-to-learn regions are mainly the occlusion regions (1a, 1c, 2a), the full mode regions (1b, 3a) and the fine objects (3 a). In these difficult areas, the network of the present application gives very low confidence, which also proves that the network can slow down the distribution of these areas to reduce their influence, thereby effectively preventing the network from over-fitting in these areas.

(3) Network module validity analysis

The effectiveness of the technical scheme is verified by continuously adding the technical scheme into the PSmNet-based network. The results of the experiment are shown in FIG. 10. With respect to the base network PSMNet, the effectiveness of applying unimodal distribution constraints to matching cost bodies is first verified, and distribution learning is constrained by cross-entropy loss (CE). It can be seen that the unimodal constraint has significant performance improvement on each index, which proves the superiority of unimodal matching cost filtering; then, solving the problem of positive and negative parallax samples in cross entropy loss (CE) by adopting Stereo FocalLoss (SF), which brings further improvement on each index; and finally, adding a self-reliability evaluation network (CENet), and also greatly improving the accuracy on each index. It is worth mentioning that, on the evaluation index ALLEPE, the technical scheme of the application reduces the performance of the PSmNet from 1.101 to 0.867, namely, the performance is improved by nearly 20%, which fully embodies the superiority and high performance of the adaptive unimodal matching cost filtering. Meanwhile, several visualization results are also given in fig. 7, which are, from left to right: a left image, a right image, a real disparity map, a predicted disparity map, an error map, a confidence map; in the error map, the warm tone means that the error is larger; in the confidence map, darker colors indicate less confidence. The prediction result of the technical scheme of the application is basically consistent with the real parallax map, and the prediction result is still good even in the area with a complex structure.

(4) Adaptive unimodal matching cost filtering effectiveness analysis

AcfNet adds a unimodal matching cost filtering constraint directly on the basis of PSmNet. Fig. 10 shows the performance comparison between two versions of AcfNet and PSMNet, where each pixel has a large performance improvement of uniform variance AcfNet (uniform) compared to PSMNet, and the adaptive version of AcfNet (adaptive) further improves the matching accuracy. This fully demonstrates the effectiveness of unimodal supervision and the superiority of adaptive variance adjustment. Compared with AcfNet (uniform), AcfNet (adaptive) has more obvious improvement in OCC (namely occlusion) areas, which is consistent with the conclusion that the confidence evaluation network of the present application can effectively detect and prevent the network from being over-fitted in the areas obtained in variance analysis.

(5) Matching cost filtering contrast analysis

Although there are many classical matching cost based filtering methods, these methods have not been comparable to existing deep learning based methods. One existing matching cost enhancement strategy is to generate a gaussian distribution centered around the true disparity, and then use the gaussian distribution as a weight to weight the matching cost which is not aggregated, so as to enhance the unimodal matching cost distribution centered around the true disparity. The existing method is mainly different from the method of the application in two points: 1) the unimodal distribution of the existing method is taken as the weight influence matching cost distribution, but the method of the application is directly used as a network supervision item, and the network can be directly guided to filter the matching cost into a single peak. 2) The actual disparity information of the existing method is needed in both training and testing phases, but only needs to be used in the training process in the present application. As shown in fig. 11, all the methods in the table are to perform random initial training on Scene Flow data set, and then directly test their generalization performance in KITTI2012 and KITTI 2015. All methods are PSMNet based networks and all available disparity information is used. As a comparison of filtering performance, the method of the application far surpasses the result of the existing matching cost enhancement strategy. Moreover, from the perspective of generalization performance, the technical scheme of the present application has 11.64% improvement on KITTI2012 and 10.74% improvement on KITTI2015 relative to PSMNet. This means that the specific unimodal constraint enables the network to learn better similarity measure and feature extraction mode, thereby showing superior generalization performance on different data sets.

(6) State-of-the-art method contrastive analysis matched with binocular stereo

To further evaluate the performance of the technical solution of the present application, a comparison between the Scene Flow, KITTI2012 and KITTI2015 on three data sets and the method embodying the highest level at present is provided in fig. 12, which includes: classification-based methods (MC-CNN, PDS, HD3-Stereo), methods for enhancing matching cost calculation (Gwc-Net), methods for stacking optimization sub-networks (iResNet-i2), methods with very powerful cost aggregation networks (PSmNet, GA-Net) and methods for adding extra information (EdgeStereo, SegStereo). Although both of them try to improve the network to obtain more robust stereo matching result, the method of the present application still outperforms the prior art in terms of performance. Where fig. 8 and 9 visualize several examples of the present application at KITTI2012 and KITTI2015 and mark where the comparison with PDS, PSMNet is significantly better. 2 visual samples are provided for each data set, in each sample, the first behavior disparity map prediction result and the second behavior error map visualization result are obtained, wherein in a KITTI2012 visual graph, white represents that the prediction is inaccurate, and in a KITTI2015, warm tones represent that the prediction is inaccurate. It can be seen that the method of the present application performs better at small objects, pictures and sky edges.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. A binocular deep learning method based on adaptive unimodal stereo matching cost filtering is characterized in that: the method is characterized in that unimodal distribution supervision with real parallax as the center is directly applied to the matching cost predicted by the network, and adaptive matching cost filtering is realized, and the method comprises the following steps:

2. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 1, wherein: generating real unimodal matching cost distribution in the step 2), namely searching a series of pixels to be matched on the same polar line of the target image for each pixel point p in the reference image, wherein the matching cost body reflects the similarity of the pixel pairs to be matched, the matching cost between real matching pairs is minimum, and the matching cost of parallax values of other parameters is increased along with the distance from real parallax; the matching cost distribution of each pixel should be centered on the true disparity.

3. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 1, wherein: in the step 3), a confidence evaluation network is constructed to estimate the confidence of the matching cost predicted by the trunk network, and the confidence evaluation network adaptively adjusts the smoothness of the unimodal distribution according to the learning difficulty of the network, namely adaptively adjusts the variance of the unimodal distribution.

4. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 1, wherein: in the step 4), stereo focusing loss calculation is carried out, and D matching costs { C) are constructed for each pixel point by the matching cost body₀，C₁，···C_D-1}, i.e. matching cost distributions; for pixel point p, the estimated matching cost distribution

And true matching costDistribution P_p(d) And measuring the similarity between the estimated matching cost distribution and the real matching cost distribution by using the cross entropy loss, and using the similarity as a network supervision item.

5. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 2, wherein: in the step 2), in the generation of the true unimodal matching cost distribution, the search set of the parallax d in the target image is assumed to be {0,1, L D-1}, wherein the true parallax value is d^gtThe true unimodal distribution is defined as:

wherein

6. The binocular deep learning method based on adaptive unimodal stereo matching cost filtering as claimed in claim 3, wherein in the confidence evaluation network in step 3), the confidence evaluation network is composed of a convolutional layer of 3 × 3, a normalization layer and a ReLU layer, and outputs a value belonging to [0,1 ] after another convolutional layer of 1 × 1 and sigmoid function]The network directly outputs the confidence degree chart f ∈ [0,1 ] for the input aggregated matching cost body]^H×WWherein H, W are image height, width, respectively.

7. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 6, wherein: in the confidence evaluation network in the step 3), for the pixel point p, the variance of the real matching cost distribution can be estimated by the estimated confidence value f_pAnd (3) carrying out dynamic adjustment: sigma_p＝s(1-f_p) Plus, where s is 0 or more is a constant, reflecting the variance σ to the confidence value f_pVarying degree of sensitivity, of>0 defines the lower bound of sigma,corresponding sigma_p∈[,s+]。

8. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 1, wherein: in the stereo focusing loss calculation in the step 4), a weight factor focusing on positive parallax loss is introduced to improve the cross entropy loss and finally form a mathematical form of the stereo focusing loss:

wherein α ≧ 0 is the focusing parameter, and when α ≧ 0, the loss function directly degenerates to cross-entropy loss, and when α > 0, the stereoscopic focusing loss will be according to P_p(d) Assigns more weight to positive disparity samples.

9. The binocular deep learning method based on the adaptive unimodal stereo matching cost filtering according to claim 1, wherein: the PSmNet stereo matching model base network comprises a Spatial Pyramid Module (Spatial Pyramid Module) for extracting image features, wherein the Spatial Pyramid Module (Spatial Pyramid Module) extracts the image features containing multi-scale context information by setting 4 parallel average Pooling modules (Pooling) with fixed sizes; the PSmNet stereo matching model base network comprises a 3D CNN framework with an encoding and decoding structure in a hourglass shape, wherein the 3D CNN framework carries out repeated processing processes from top to bottom and from bottom to top and supervises and learns matching cost bodies output by the PSmNet stereo matching model base network in three stages.