CN115375746A

CN115375746A - Stereo matching method based on double-space pooling pyramid

Info

Publication number: CN115375746A
Application number: CN202210336322.8A
Authority: CN
Inventors: 何立火; 刘晓天; 唐杰浩; 柯俊杰; 高新波; 路文; 李洁; 王笛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-11-22

Abstract

The invention discloses a stereo matching method based on a double-space pyramid, which mainly solves the problems that multi-scale semantic information is lost and texture information and overall information of features cannot be fully extracted in the existing binocular stereo matching method, and the implementation scheme is as follows: acquiring a binocular stereo matching training sample set T and a verification sample set V; respectively establishing a feature extraction network E, a cost aggregation network R and a parallax regression network G, and cascading to form a binocular stereo matching network S; training the binocular stereo matching network S by using a training sample set T through a random gradient descent method; inputting the verification sample set V into the trained binocular stereo matching network S ^t In the method, a result I of stereo matching is obtained ⁱ _pred . The method improves the matching precision of the stereo matching method at semantic edges and in complicated texture areas, effectively reduces outliers in the matching result, effectively improves the accuracy of the stereo matching result, and can be used for performing parallax estimation on images and videos acquired by a binocular camera.

Description

Stereo matching method based on double-space pooling pyramid

Technical Field

The method belongs to the technical field of computer vision, and further relates to a stereo matching method which can be used for carrying out parallax estimation on images and videos acquired by a binocular camera.

Background

Binocular vision is an important direction in the field of computer vision, and the stereo matching task is one of the most important subtasks in binocular vision. The stereo matching refers to polar line correction of a two-dimensional binocular image acquired by a binocular camera, and the disparity value and depth information under a three-dimensional scene are acquired by matching feature points of the corrected two-dimensional binocular image. The basic principle is to search corresponding points of the binocular images in the left and right eye images, and obtain the absolute value of the difference between the relative abscissas of the corresponding points in the left and right eye images, so as to obtain the parallax value. And the three-dimensional depth information can be further obtained by utilizing the parallax value, the focal length of the camera and the base line distance of the binocular camera. The stereo matching method is established on the basis of a space geometric structure, and has the advantages of high solving speed, low hardware cost and the like.

The traditional stereo matching method is divided into a global method and a semi-global method. The global method constructs a global energy function based on the smoothness assumption, and solves a minimum value point of the energy function by using an optimization method. The method has the advantages of large calculation amount, long time and poor real-time performance. The semi-global method utilizes local information, calculates total matching cost in a matching window with a specific size, and utilizes a winner-eating WTA strategy to search a minimum value to finally obtain a parallax value. The method is greatly influenced by the size of the window, the condition that the window is too small can cause that all texture information of an object cannot be contained generates ambiguity, and the condition that the window is too large can generate an expansion effect in a depth discontinuous area and increase the calculated amount.

With the development of computer technology, deep learning in the field of artificial intelligence is rapidly developed, and compared with the traditional stereo matching method, the stereo matching method based on deep learning has stronger feature expression capability and feature learning capability and has the advantage of parallelization. The stereo matching method using deep learning can complete the establishment of the stereo matching network through the establishment based on the convolutional neural network, continuous training and gradient updating based on back propagation, and realize the function of completing stereo matching at one time.

The patent documents with the application numbers of 202110550638.2 and 113177565a of the university of beijing rationality propose 'a neighbor propagation stereo matching method based on image pyramid distance measurement', which comprises the following implementation steps: 1) Extracting features of a stereo image pair consisting of the left image and the right image, and pairing the stereo image pair by using a distance measurement mode; 2) Carrying out continuous double down-sampling on the image, and matching the image after down-sampling; 3) And selecting the pixel point pair which is correctly matched, and spreading the matching result to the adjacent pixel point. According to the method, the matching points are searched by using the constraint of feature matching construction, but because the image pyramid is constructed and the features are obtained only aiming at the input left and right target images, the information such as the deeper semantics and textures of the images cannot be fully extracted, and the precision of the obtained matching result is poor.

The Shandong university proposes 'a stereo matching feature extraction and post-processing method of a parallax map' in patent documents with application numbers 202110281313.9 and publication numbers 112991420A, and the implementation steps are as follows: 1) Extracting stereo matching features by using a convolution kernel, and extracting a pyramid feature by using the convolution kernel; 2) Aggregating the characteristics output by the convolution network by using a 4-path cost aggregation method; 3) And performing left-right consistency detection on the disparity map by using internal left-right consistency detection, and filling invalid disparity points to obtain a final disparity map. The method has the following defects: the convolutional neural network is only used for the feature extraction part, and an end-to-end network is not constructed for learning and training, so that the further learning of the algorithm on parallax and the training difficulty of the model are not facilitated; meanwhile, the complexity of the cost aggregation algorithm used by the method is high, parallel processing and acceleration are not easy to perform, and the finally output disparity map is low in precision and long in time consumption.

The application patent documents of the Beijing university of science and engineering with the application numbers 202110550638.2 and 113177565A disclose a binocular vision position measuring system and method based on deep learning, which are used for binocular vision position measurement. The system comprises an image capturing module, a deep learning object recognition module, an image segmentation module, a fitting module and a binocular point cloud module. The method comprises the following implementation steps: 1) Building a parallax calculation submodule by using a convolutional neural network to obtain a parallax image of the parallax calculation submodule on the binocular left camera; 2) And (4) using a point cloud computing submodule to obtain three-dimensional point cloud information under the scene through the internal and external parameters of the binocular camera on the obtained disparity map. According to the method, because a semi-global matching method is used during parallax aggregation, the obtained parallax image is prone to more outliers, and more finally output noise is caused.

Jianren Chang et al, in the paper "Pyramid Stereo Matching Network" (Conference on Computer Vision and Pattern Recognition, 2018), propose an end-to-end binocular Stereo Matching Network. The network firstly uses a convolutional neural network with a space pooling pyramid to carry out feature extraction on binocular images, then constructs a four-dimensional convolutional cost body, and learns and aggregates the cost body by using a stacked three-dimensional convolutional neural network to finally obtain a disparity map. According to the method, as the feature extraction network only uses the spatial average pooling pyramid to extract average features on multiple scales, edge information and texture information in an original image and a feature map cannot be reserved, and final left and right eye image matching failure is easily caused, so that the matching effect between semantic edges and complex texture areas is poor.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a stereo matching method based on a double-space pooling pyramid, so that the characteristic information of an image is fully extracted, the matching effect of a speech edge and a texture complex area is improved, outliers in a matching result are reduced, the accuracy of final stereo matching is improved, and a more accurate disparity map is obtained.

The technical key points of the invention are as follows: the precision of the final output disparity map is improved by using a double-pooling pyramid structure in a feature extraction module in the end-to-end binocular stereo matching network. The implementation scheme comprises the following steps:

(1) Obtaining a binocular stereo matching training sample set T and a verification sample set V

Acquiring N + M groups of image pairs I from existing binocular stereo matching data set ⁱ ＝{I ⁱ _L ,I ⁱ _R ,I ⁱ _GT In which I ⁱ _L 、I ⁱ _R Left and right eye images, I, of the ith image pair ⁱ _GT True parallax images for the ith image pair;

n image pairs in a data set sample set T = { I ] as training ¹ ,I ² ,…,I ^j ,…,I ^N J is more than or equal to 1 and less than or equal to N; verification set V = { I) with M image pairs as training results ^N+1 ,I ^N+2 ,…,I ^N+k ,…,I ^N+M }，1≤k≤M，N≥180、M≥20；

(2) Binocular stereo matching network S based on dual-space pooling pyramid

(2a) Establishing an average pooling pyramid E2 _AVG And a maximal pooling pyramid E2 _MAX A double pyramid pooling network E2 formed by parallel connection;

(2b) Establishing a feature extraction network E formed by cascading an initial convolutional neural network E1, a double pooling pyramid network E2 and a feature fusion network E3;

(2c) Establishing a cost aggregation network R formed by cascading three stacked three-dimensional convolutional neural networks;

(2d) Cascading the feature extraction network E, the cost aggregation network R and the parallax regression network G to form a binocular stereo matching network S;

(3) Training a binocular stereo matching network S:

(3a) Setting learning rate eta and maximum training times t _max The current iteration period t' =0;

(3b) For left eye image I in training set T ⁱ _L With the right eye image I ⁱ _R Predicted disparity map I using forward propagation method to obtain network output ⁱ _pred ；

(3c) Computing a predicted disparity map I using the SmoothL1 loss function on the output image ⁱ _pred And the real parallax map I ⁱ _GT Error L between ⁱ And using a back-propagation gradient descent algorithm to the current network S ^t The weight parameter ω of the convolution kernel in (1) ^t And a convolution kernel bias parameter b ^t Updating is carried out;

(3d) Adding 1 to the current iteration cycle, namely t '= t' +1, and judging whether the maximum iteration time is reached:

if t' = t _max If the number of iterations reaches the maximum number, the training is finished, and the trained binocular stereo matching network S is obtained ^t ，

If t' < t _max And then returning to (3 b);

(4) Inputting the verification sample set V into a trained stereo matching network S of the double spatial pooling pyramid, and sequentially passing through a feature extraction network E, a cost aggregation network R and a parallax regression network G to obtain a prediction result I of the parallax map ⁱ _pred And the prediction result is the result of stereo matching.

Compared with the prior art, the invention has the following advantages:

1. according to the invention, the double pyramid pooling network E2 is used, so that the multi-scale characteristics with rich texture information and global semantic information are provided for the binocular stereo matching network, and the stereo matching precision of stereo matching results at semantic edges, texture complex areas and texture missing areas is improved.

2. The invention realizes the extraction and further fusion of the features of the left eye image and the right eye image by using the feature extraction network E, improves the accuracy of the final stereo matching result, and effectively reduces wrong outliers in the matching result.

3. According to the invention, the cost aggregation network R and the parallax regression network G are used, so that the binocular images are matched, and the predicted parallax images are output.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of a pooling pyramid structure in the present invention;

fig. 3 is an output disparity map after stereo matching of left and right eye images using the simulation of the present invention.

Detailed Description

Embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this example are as follows:

step 1, acquiring a binocular stereo matching training sample set T and a verification sample set V.

A KITTI data set is obtained, the data set is established by Geiger et al of Carlsuhe university, and the data set is shot under three scenes, namely urban, rural and highway, by using an automobile provided with two high-definition color cameras, a Velodyne laser radar and a GPS. Information such as binocular images, disparity maps and light flow maps is provided through the data set, and the data set can be used for tasks such as stereo matching, light flow prediction, 3D object detection and 3D tracking. It is divided into two subdata sets, KITTI2012 and KITTI 2015. The KITTI2012 sub-data set provides 194 groups of training image groups containing binocular images and real disparity values, and 195 groups of verification image groups containing only binocular images and not disclosed by the real disparity values; the KITTI2015 is expanded on the basis, and 200 training image groups and 200 verification image groups are added.

N + M groups of image pairs I with resolution of WxH are obtained from KITTI binocular stereo matching data set ⁱ ＝{I ⁱ _L ,I ⁱ _R ,I ⁱ _GT In which I ⁱ _L 、I ⁱ _R Left and right eye images, I, of the ith image pair ⁱ _GT For the true parallax image of the ith image pair, W =512 and H =256 in the present embodiment.

Taking N =394 image pairs in the dataset as a training sample set T = { I = { (I) } ¹ ,I ² ,…,I ^j ,…,I ^N }，1≤j≤N；

Validation set V = { I) with M =200 image pairs as training results ^N+1 ,I ^N+2 ,…,I ^N+k ,…,I ^N+M }，1≤k≤M，N≥180、M≥20。

Taking a total of 394 training image groups in KITTI2012 and KITTI2015 as a training image set T,

200 verification images in KITTI2012 are taken as a verification set V ₁ ，

200 verification images in KITTI2015 are used as a verification set V ₂ 。

And 2, constructing a stereo matching network S based on the double spatial pooling pyramid.

2.1 With reference to FIG. 2, a pyramid E2 is built by averaging pooling _AVG And a maximal pooling pyramid E2 _MAX A double pyramid network E2 formed in parallel, wherein:

average pooling pyramid E2 _AVG Four average pooled cores of 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 pooled cores each;

maximal pooling pyramid E2 _MAX Four largest pooling cores of 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 pooling cores of each;

2.2 Establishing a feature extraction network E formed by cascading an initial convolutional neural network E1, a double-pooling pyramid network E2 and a feature fusion network E3, wherein:

the initial convolutional neural network E1 comprises three residual error units which are connected in sequence, each residual error unit comprises two groups of convolutional layers, and the first group of convolutional layers comprises two groups of 32 convolutional kernels which are connected in sequence and have the size of 3 x 3 and the step length of 2; the second group of convolution layers comprises three groups of 32 convolution kernels which are sequentially connected and have the size of 3 multiplied by 3 and the step length of 1;

the characteristic fusion network E3 comprises four groups of convolution layers which are connected in parallel, each group of convolution layers comprises two layers of convolution kernels which are connected in sequence, the size of each layer of convolution kernel is 3 multiplied by 3, the step length is 1, and the number of the convolution kernels is 32;

2.3 Establishing a cost aggregation network R formed by cascading three stacked three-dimensional convolutional neural networks, wherein each three-dimensional convolutional neural network is formed by three groups of convolutional layers which are connected in sequence, and each group of convolutional layers comprises 32 three-dimensional convolutional cores with the size of 3 multiplied by 3 and the step length of 1;

2.4 Establishing a parallax regression network G consisting of two groups of three-dimensional convolution layers which are connected in sequence, wherein the first group comprises 32 three-dimensional convolution kernels with the size of 3 x 3 and the step length of 1, and the second group comprises 1 three-dimensional convolution kernel with the size of 3 x 3 and the step length of 1;

2.5 The feature extraction network E, the cost aggregation network R and the parallax regression network G are cascaded to form a binocular stereo matching network S.

And 3, training the binocular stereo matching network S based on the double pooling pyramid.

3.1 Setting learning rate eta, maximum number of training times t _max The current iteration period t' =0, η =0.001 in this embodiment _max ＝500；

3.2 For left eye image I in training set T ⁱ _L With the right eye image I ⁱ _R Predicted disparity map I using forward propagation method to obtain network output ⁱ _pred ：

3.2 a) separately for the left image I using an initial convolutional neural network E1 ⁱ _L With the right image I ⁱ _R Carrying out feature extraction to respectively obtain left image feature maps F ⁱ _L And the feature map F of the right image ⁱ _R ；

3.2 b) left and right eye feature map F using dual pooling pyramid network E2 ⁱ _L 、F ⁱ _R Performing pooling operation to obtain a left eye average pooling feature map group F ⁱ _{L_AVG} Left eye maximum poolingFeature map set F ⁱ _{L_MAX} And a set of right eye average pooling profiles F ⁱ _{R_AVG} Right eye maximum pooling feature map set F ⁱ _{R_MAX} Wherein:

pooling pyramid E2 in average space _AVG The left image feature map and the right image feature map are subjected to average pooling operation by using four average pooling cores with different sizes to extract features containing semantic information, and a left eye average pooling feature map group F is obtained ⁱ _{L_AVG} Right eye average pooling feature map set F ⁱ _{R_AVG} ：

F ⁱ _{L_AVG} ＝{F ⁱ _{L_avg1} ,F ⁱ _{L_avg2} ,F ⁱ _{L_avg3} ,F ⁱ _{L_avg4} }

F ⁱ _{R_AVG} ＝{F ⁱ _{R_avg1} ,F ⁱ _{R_avg2} ,F ⁱ _{R_avg3} ,F ⁱ _{R_avg4} }

Wherein, F ⁱ _{L_avgi} 、F ⁱ _{R_avgi} I is more than or equal to 1 and less than or equal to 4 and is the output of the ith average pooling kernel;

pooling pyramid E2 in maximum space _MAX The feature with texture and edge information is obtained by performing maximum pooling operation on the left image feature map and the right image feature map by using four maximum pooling kernels with different sizes, and a maximum pooling feature map group F of the left eye image is obtained ⁱ _{L_MAX} Right eye image maximum pooling feature map group F ⁱ _{R_MAX} ：

F ⁱ _{L_MAX} ＝{F ⁱ _{L_max1} ,F ⁱ _{L_max2} ,F ⁱ _{L_max3} ,F ⁱ _{L_max4} }

F ⁱ _{R_MAX} ＝{F ⁱ _{R_max1} ,F ⁱ _{R_max2} ,F ⁱ _{R_max3} ,F ⁱ _{R_max4} }

Wherein, F ⁱ _{L_maxi} 、F ⁱ _{R_maxi} I is more than or equal to 1 and less than or equal to 4 and is the output of the ith maximum pooling kernel;

3.2c) Stacking the feature maps obtained by the two pooling pyramids, passing through a feature fusion network E3, and outputting a left image feature map

And the right image multi-scale feature map

3.2D) setting a disparity search range D, where D =192 in this embodiment, first constructing a cost body of a current disparity value for each disparity value D in a splicing manner to obtain a three-dimensional matching cost body group C ⁱ ＝{C ⁱ ₁ ,C ⁱ ₂ ,…,C ⁱ _d ,…,C ⁱ _D D three-dimensional matching cost bodies C are sequentially arranged ⁱ _d Splicing is carried out to construct a four-dimensional matching cost body

3.2 e) matching the four-dimensional matching cost body C with the cost aggregation network R ⁱ _d Performing convolution operation to obtain aggregated cost body

3.2 f) Using the parallax regression network G, the aggregated cost body

Sequentially carrying out convolution and channel compression operation, and carrying out size recovery by using a bilinear interpolation method to obtain a parallax three-dimensional cost body C ⁱ _pred ；

3.2 g) for three-dimensional cost body C ⁱ _pred Obtaining a disparity map I predicted by a network by using a soft argmin operation ⁱ _pred The formula is as follows:

3.3 Computing a predicted disparity map I using SmoothL1 loss function on the output image ⁱ _pred And the real parallax map I ⁱ _GT Error L between ⁱ The calculation formula is as follows:

3.4 Using a back-propagation gradient descent algorithm for the current network S ^t The weight parameter ω of the convolution kernel in (1) ^t And a convolution kernel bias parameter b ^t Updating is carried out, and the formula is as follows:

wherein, ω is ^t+1 And b ^t+1 Respectively representing the updated convolution kernel weight parameter and the updated convolution kernel bias parameter.

3.4 Add 1 to the current iteration cycle, i.e. t '= t' +1, and determine whether the maximum number of iterations is reached:

If t' < t _max Then return to 3.2);

and 4, acquiring a binocular stereo matching verification set result by using the trained binocular stereo matching network.

Inputting the verification sample set V into the stereo matching network S of the trained dual-space pooling pyramid ^t Firstly, the image I passes through the characteristic extraction network E, and the left eye image I of the ith sample is respectively processed ⁱ _L With the right eye image I ⁱ _R Extracting left image feature map containing semantic information features and texture edge information features

And the right image multi-scale feature map

And constructing cost bodies of d parallax values by splicing the two characteristics, and constructing a four-dimensional matching cost body by splicing

Carrying out convolution operation on the four-dimensional cost body through a cost aggregation network R to obtain an aggregated cost body; the aggregated cost body is subjected to convolution and channel compression operation through a parallax regression network G, and dimension recovery is carried out on the cost body by utilizing a bilinear interpolation method to obtain a parallax three-dimensional cost body C ⁱ _pred Then using soft argmin operation to the parallax three-dimensional cost body to obtain the result of stereo matching, namely the predicted parallax image I ⁱ _pred 。

The effect of the present invention is further described below with the simulation experiment:

1. conditions of simulation experiment

The software platform of the simulation experiment is as follows: ubuntu 16.04LTS operating system, python3.7 programming language, pytorch deep learning framework;

the hardware platform of the simulation experiment is as follows: the CPU is a host computer of Intel i9-7700X, the memory is 32GB, the graphics processor is Ingland RTX 2080Ti, and the video memory is 11GB.

The data set for the simulation experiment was: the total 394 training image groups in KITTI2012 and KITTI2015 are training image sets T, and 200 verification images in KITTI2012 are verification sets V ₁ 200 verification images in KITTI2015 are verification set V ₂ 。

Validation set V at KITTI2012 ₁ In (1), the disparity value I is predicted ⁱ _pred From the true disparity value I ⁱ _GT Is greater than the area A of e pixels _e-all Area S of _Ae-all Account for image A _all Total area S _Aall Is defined as e-all evaluation index:

the parallax non-occlusion area A _noc Region A with middle parallax error value larger than e pixel value _e-noc Area S of _Ae-noc Occupying the area S of the parallax non-occlusion region _Anoc The ratio of (A) to (B) is defined as an e-noc evaluation index:

in this example e =2,3, four types of evaluation indices are available: a two-pixel global error 2-all, a three-pixel global error 3-all, a two-pixel non-closing error 2-noc, and a three-pixel non-closing error 3-noc;

validation set V at KITTI2015 ₂ In the area A, the error between the predicted disparity map and the real disparity map in the area A is more than 3 pixels or more than 5% of the real disparity value of the point _t The ratio to the area of the region a is defined as the D1 error evaluation index within the region a:

S _At is a region A _t Area of (S) _A The area of region A; in verification set V ₂ In the area A, a foreground area of an image is respectively selected to calculate foreground error evaluation indexes D1-fg, a background area is selected to calculate background error evaluation indexes D1-bg, and an integral area is selected to calculate global error evaluation indexes D1-all.

2. Simulation content and analysis of results thereof

Simulation 1, taking 394 training image groups in total in KITTI2012 and KITTI2015 as training image sets T, training the constructed binocular stereo matching network S to obtain a trained stereo matching network S ^t Using 200 verification images in KITTI2012 as verification set V ₁ Input to the trained stereo matching networkCollaterals of S ^t Obtaining a stereo matching result of the method, and calculating two-pixel global error 2-all, three-pixel global error 3-all, two-pixel non-closed error 2-noc and three-pixel non-closed error 3-noc evaluation indexes;

taking a paper 'Pyramid Stereo Matching Network' as a reference method, constructing a reference Network by using the method in the paper, training the reference Network by using a training image set T to obtain a trained reference Network, and verifying a set V ₁ Inputting the three-dimensional matching result into a trained reference network to obtain a three-dimensional matching result of a reference method, and calculating two-pixel global error 2-all, three-pixel global error 3-all, two-pixel non-closed error 2-noc and three-pixel non-closed error 3-noc evaluation indexes;

the KITTI2012 test set is compared with the evaluation indexes obtained by the method of the invention through the reference method, and the results are shown in Table 1:

TABLE 1 comparison of the validation metrics on the KITTI2012 data set for the method of the present invention and the benchmark method

As can be seen from Table 1, validation set V is set 2012 in KITTI ₁ In the method, the evaluation of the method is improved relative to the evaluation of a reference method, wherein the 2-noc index is improved by 1.18 percent, the 2-all index is improved by 0.23 percent, the 3-noc index is improved by 0.07 percent, and the 3-all index is improved by 0.16 percent.

Simulation 2. Using the trained stereo matching network S in simulation 1 ^t Taking 200 verification images in KITTI2015 as a verification set V ₂ Input to the trained stereo matching network S ^t As shown in fig. 3, fig. 3 (a) is a left eye image to be matched, fig. 3 (b) is a right eye image to be matched, and fig. 3 (c) is an output disparity map obtained by performing stereo matching on the left eye image and the right eye image by using the simulation of the present invention. Fig. 3 (c) shows that the present invention can obtain better matching result with complex texture region at speech edge.

Calculating a foreground error evaluation index D1-fg, a background error evaluation index D1-bg and a global error evaluation index D1-all of a stereo matching result of the method;

using the trained reference network in simulation 1, the verification set V is ₂ Inputting the three-dimensional matching result into a trained reference network to obtain a three-dimensional matching result of a reference method, and calculating a foreground error evaluation index D1-fg, a background error evaluation index D1-bg and a global error evaluation index D1-all;

the results of comparing the KITTI2015 test set with the evaluation indexes obtained by the method of the invention through the reference method are shown in Table 2:

TABLE 2 comparison of the validation index of the method of the invention with that of the reference method on KITTI2015 data set

As can be seen from Table 2, data set V was recorded at KITTI2015 ₂ The verification is carried out, the global error index D1-all is improved by 0.19 percent compared with the reference method, the background error index is improved by 0.06 percent compared with the reference method D1-bg, the foreground error index D1-fg is improved by 0.67 percent compared with the reference method,

the simulation results show that the data set V is verified in KITTI2012 by the method of the invention compared with the reference method ₁ And KITTI2015 validation data set V ₂ The error evaluation indexes are improved, so that the accuracy of the stereo matching result can be effectively improved, and a more accurate disparity map can be obtained.

Claims

1. A stereo matching method based on a double-pooling pyramid structure is characterized by comprising the following steps:

Acquiring N + M groups of image pairs I from existing binocular stereo matching data set ⁱ ＝{I ⁱ _L ,I ⁱ _R ,I ⁱ _GT In which I is ⁱ _L 、I ⁱ _R Left and right eye images, I, of the ith image pair ⁱ _GT True parallax images for the ith image pair;

taking N image pairs in the data set as a training sample set T = { I = { (I) ¹ ,I ² ,…,I ^j ,…,I ^N J is more than or equal to 1 and less than or equal to N; verification set V = { I) with M image pairs as training results ^N+1 ,I ^N+2 ,…,I ^N+k ,…,I ^N+M }，1≤k≤M，N≥180、M≥20；

(2) Binocular stereo matching network S based on dual-space pooling pyramid

(2a) Establishing an average pooling pyramid E2 _AVG And a maximal pooling pyramid E2 _MAX A double pyramid network E2 formed by parallel connection;

(3) Training a binocular stereo matching network S:

(3b) For left eye image I in training set T ⁱ _L With the right eye image I ⁱ _R Predicted disparity map I output by network obtained by forward propagation method ⁱ _pred ；

(3c) Computing a predicted disparity map I using the SmoothL1 loss function on the output image ⁱ _pred And the real parallax map I ⁱ _GT Error L therebetween ⁱ And using a back-propagation gradient descent algorithm to the current network S ^t The weight parameter ω of the convolution kernel in (1) ^t And a convolution kernel bias parameter b ^t Updating is carried out;

(3d) Adding 1 to the current iteration cycle, namely t '= t' +1, and judging whether the maximum iteration frequency is reached:

if t' = t _max If so, the maximum iteration times are reached, and the training is finished at the moment to obtain the trained binocular stereo matching network

S ^t If t' < t _max And then returning to (3 b);

(4) Inputting the verification sample set V into a trained stereo matching network S of the double spatial pooling pyramid, and sequentially passing through a feature extraction network E, a cost aggregation network R and a parallax regression network G to obtain a prediction result I of the parallax image ⁱ _pred And the prediction result is the result of stereo matching.

2. The method according to claim 1, wherein the left eye image I in the training set T in the (3 b) is subjected to image matching ⁱ _L With the right eye image I ⁱ _R Predicted disparity map I output by network obtained by forward propagation method ⁱ _pred The implementation is as follows:

(3b1) Respectively aligning the left image I by using an initial convolutional neural network E1 ⁱ _L With right image I ⁱ _R Carrying out feature extraction to respectively obtain left image feature maps F ⁱ _L And the right image feature map F ⁱ _R ；

(3b2) Performing pooling operation on the left and right eye feature maps by using a dual-pooling pyramid network E2 to obtain a left eye average pooled feature map group F ⁱ _{L_AVG} Left eye maximum pooling feature map set F ⁱ _{L_MAX} And a set of right eye average pooling profiles F ⁱ _{R_AVG} Right eye maximum pooling feature map set F ⁱ _{R_MAX} ；

(3b3) Stacking the feature maps obtained by the two pooling pyramids, passing through a feature fusion network E3, and outputting a left image feature map

And the right image multi-scale feature map

(3b4) Setting parallax search rangeD, constructing a cost body of the current disparity value by splicing each disparity value D to obtain a three-dimensional matching cost body group C ⁱ ＝{C ⁱ ₁ ,C ⁱ ₂ ,…,C ⁱ _d ,…,C ⁱ _D D three-dimensional matching cost bodies C are sequentially arranged ⁱ _d Splicing is carried out to construct a four-dimensional matching cost body

1≤d≤D；

(3b5) Using cost aggregation network R to match with four-dimensional cost body C ⁱ _d Performing convolution operation to obtain aggregated cost body

(3b6) Using a parallax regression network G to combine the aggregated costs

(3b7) For three-dimensional cost body C ⁱ _pred Obtaining a disparity map I predicted by a network by using a soft argmin operation ⁱ _pred 。

3. The method according to claim 2, wherein the left and right eye feature maps in (3 b 2) are pooled using a dual-pooling pyramid network E2 as follows:

pooling of pyramids E2 in maximum space _MAX The maximum pooling operation is carried out on the left image feature map and the right image feature map by using four maximum pooling kernels with different sizes to obtain features with texture and edge information, and a maximum pooling feature map group F of the left eye image is obtained ⁱ _{L_MAX} Right eye image maximum pooling feature map group F ⁱ _{R_MAX} ：

Wherein, F ⁱ _{L_maxi} 、F ⁱ _{R_maxi} I is more than or equal to 1 and less than or equal to 4 and is the output of the ith largest pooling kernel.

4. The method of claim 1, wherein the average pooling pyramid E2 in (2 a) is _AVG And a maximal pooling pyramid E2 _MAX The structure is as follows:

maximum pooling pyramid E2 _MAX Four maximal pooled cores of size 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 per pooled core.

5. The method according to claim 1, wherein the initial convolutional neural network E1 and the feature fusion network E3 in (2 b) have the following structures:

the feature fusion network E3 comprises four groups of convolution layers connected in parallel, each group of convolution layers comprises two layers of convolution kernels connected in sequence, the size of each layer of convolution kernels is 3 multiplied by 3, the step length is 1, and the number is 32.

6. The method of claim 1, wherein each three-dimensional convolutional neural network in (2 c) is composed of three groups of convolutional layers connected in sequence, each group of convolutional layers containing 32 three-dimensional convolutional kernels with size of 3 x 3 and step size of 1.

7. The method according to claim 1, wherein the disparity regression network G in (2 d) is composed of two sequentially connected three-dimensional convolution layers, the first group comprises 32 three-dimensional convolution kernels with the size of 3 x 3 and the step size of 1, and the second group comprises 1 three-dimensional convolution kernel with the size of 3 x 3 and the step size of 1.

8. The method according to claim 1, wherein the calculating of the predicted disparity map I in (3 c) uses a SmoothL1 loss function ⁱ _pred And the real parallax map I ⁱ _GT Value of loss L in between ⁱ The calculation formula is as follows:

9. according toThe method of claim 1, wherein said step (3 c) comprises applying a back-propagation gradient descent algorithm to the current network S ^t The weight parameter ω of the convolution kernel in (1) ^t And a convolution kernel bias parameter b ^t Updating is carried out, and the formula is as follows:

wherein L is ⁱ For predicting disparity maps I ⁱ _pred And the real parallax map I ⁱ _GT Eta represents the network update learning rate, eta is more than or equal to 0.0001 and less than or equal to 0.001, omega ^t+1 And b ^t+1 Respectively representing the updated convolution kernel weight parameter and the updated convolution kernel bias parameter.