CN115375746A - Stereo matching method based on double-space pooling pyramid - Google Patents

Stereo matching method based on double-space pooling pyramid Download PDF

Info

Publication number
CN115375746A
CN115375746A CN202210336322.8A CN202210336322A CN115375746A CN 115375746 A CN115375746 A CN 115375746A CN 202210336322 A CN202210336322 A CN 202210336322A CN 115375746 A CN115375746 A CN 115375746A
Authority
CN
China
Prior art keywords
network
pooling
image
stereo matching
max
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210336322.8A
Other languages
Chinese (zh)
Inventor
何立火
刘晓天
唐杰浩
柯俊杰
高新波
路文
李洁
王笛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210336322.8A priority Critical patent/CN115375746A/en
Publication of CN115375746A publication Critical patent/CN115375746A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a stereo matching method based on a double-space pyramid, which mainly solves the problems that multi-scale semantic information is lost and texture information and overall information of features cannot be fully extracted in the existing binocular stereo matching method, and the implementation scheme is as follows: acquiring a binocular stereo matching training sample set T and a verification sample set V; respectively establishing a feature extraction network E, a cost aggregation network R and a parallax regression network G, and cascading to form a binocular stereo matching network S; training the binocular stereo matching network S by using a training sample set T through a random gradient descent method; inputting the verification sample set V into the trained binocular stereo matching network S t In the method, a result I of stereo matching is obtained i pred . The method improves the matching precision of the stereo matching method at semantic edges and in complicated texture areas, effectively reduces outliers in the matching result, effectively improves the accuracy of the stereo matching result, and can be used for performing parallax estimation on images and videos acquired by a binocular camera.

Description

Stereo matching method based on double-space pooling pyramid
Technical Field
The method belongs to the technical field of computer vision, and further relates to a stereo matching method which can be used for carrying out parallax estimation on images and videos acquired by a binocular camera.
Background
Binocular vision is an important direction in the field of computer vision, and the stereo matching task is one of the most important subtasks in binocular vision. The stereo matching refers to polar line correction of a two-dimensional binocular image acquired by a binocular camera, and the disparity value and depth information under a three-dimensional scene are acquired by matching feature points of the corrected two-dimensional binocular image. The basic principle is to search corresponding points of the binocular images in the left and right eye images, and obtain the absolute value of the difference between the relative abscissas of the corresponding points in the left and right eye images, so as to obtain the parallax value. And the three-dimensional depth information can be further obtained by utilizing the parallax value, the focal length of the camera and the base line distance of the binocular camera. The stereo matching method is established on the basis of a space geometric structure, and has the advantages of high solving speed, low hardware cost and the like.
The traditional stereo matching method is divided into a global method and a semi-global method. The global method constructs a global energy function based on the smoothness assumption, and solves a minimum value point of the energy function by using an optimization method. The method has the advantages of large calculation amount, long time and poor real-time performance. The semi-global method utilizes local information, calculates total matching cost in a matching window with a specific size, and utilizes a winner-eating WTA strategy to search a minimum value to finally obtain a parallax value. The method is greatly influenced by the size of the window, the condition that the window is too small can cause that all texture information of an object cannot be contained generates ambiguity, and the condition that the window is too large can generate an expansion effect in a depth discontinuous area and increase the calculated amount.
With the development of computer technology, deep learning in the field of artificial intelligence is rapidly developed, and compared with the traditional stereo matching method, the stereo matching method based on deep learning has stronger feature expression capability and feature learning capability and has the advantage of parallelization. The stereo matching method using deep learning can complete the establishment of the stereo matching network through the establishment based on the convolutional neural network, continuous training and gradient updating based on back propagation, and realize the function of completing stereo matching at one time.
The patent documents with the application numbers of 202110550638.2 and 113177565a of the university of beijing rationality propose 'a neighbor propagation stereo matching method based on image pyramid distance measurement', which comprises the following implementation steps: 1) Extracting features of a stereo image pair consisting of the left image and the right image, and pairing the stereo image pair by using a distance measurement mode; 2) Carrying out continuous double down-sampling on the image, and matching the image after down-sampling; 3) And selecting the pixel point pair which is correctly matched, and spreading the matching result to the adjacent pixel point. According to the method, the matching points are searched by using the constraint of feature matching construction, but because the image pyramid is constructed and the features are obtained only aiming at the input left and right target images, the information such as the deeper semantics and textures of the images cannot be fully extracted, and the precision of the obtained matching result is poor.
The Shandong university proposes 'a stereo matching feature extraction and post-processing method of a parallax map' in patent documents with application numbers 202110281313.9 and publication numbers 112991420A, and the implementation steps are as follows: 1) Extracting stereo matching features by using a convolution kernel, and extracting a pyramid feature by using the convolution kernel; 2) Aggregating the characteristics output by the convolution network by using a 4-path cost aggregation method; 3) And performing left-right consistency detection on the disparity map by using internal left-right consistency detection, and filling invalid disparity points to obtain a final disparity map. The method has the following defects: the convolutional neural network is only used for the feature extraction part, and an end-to-end network is not constructed for learning and training, so that the further learning of the algorithm on parallax and the training difficulty of the model are not facilitated; meanwhile, the complexity of the cost aggregation algorithm used by the method is high, parallel processing and acceleration are not easy to perform, and the finally output disparity map is low in precision and long in time consumption.
The application patent documents of the Beijing university of science and engineering with the application numbers 202110550638.2 and 113177565A disclose a binocular vision position measuring system and method based on deep learning, which are used for binocular vision position measurement. The system comprises an image capturing module, a deep learning object recognition module, an image segmentation module, a fitting module and a binocular point cloud module. The method comprises the following implementation steps: 1) Building a parallax calculation submodule by using a convolutional neural network to obtain a parallax image of the parallax calculation submodule on the binocular left camera; 2) And (4) using a point cloud computing submodule to obtain three-dimensional point cloud information under the scene through the internal and external parameters of the binocular camera on the obtained disparity map. According to the method, because a semi-global matching method is used during parallax aggregation, the obtained parallax image is prone to more outliers, and more finally output noise is caused.
Jianren Chang et al, in the paper "Pyramid Stereo Matching Network" (Conference on Computer Vision and Pattern Recognition, 2018), propose an end-to-end binocular Stereo Matching Network. The network firstly uses a convolutional neural network with a space pooling pyramid to carry out feature extraction on binocular images, then constructs a four-dimensional convolutional cost body, and learns and aggregates the cost body by using a stacked three-dimensional convolutional neural network to finally obtain a disparity map. According to the method, as the feature extraction network only uses the spatial average pooling pyramid to extract average features on multiple scales, edge information and texture information in an original image and a feature map cannot be reserved, and final left and right eye image matching failure is easily caused, so that the matching effect between semantic edges and complex texture areas is poor.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a stereo matching method based on a double-space pooling pyramid, so that the characteristic information of an image is fully extracted, the matching effect of a speech edge and a texture complex area is improved, outliers in a matching result are reduced, the accuracy of final stereo matching is improved, and a more accurate disparity map is obtained.
The technical key points of the invention are as follows: the precision of the final output disparity map is improved by using a double-pooling pyramid structure in a feature extraction module in the end-to-end binocular stereo matching network. The implementation scheme comprises the following steps:
(1) Obtaining a binocular stereo matching training sample set T and a verification sample set V
Acquiring N + M groups of image pairs I from existing binocular stereo matching data set i ={I i L ,I i R ,I i GT In which I i L 、I i R Left and right eye images, I, of the ith image pair i GT True parallax images for the ith image pair;
n image pairs in a data set sample set T = { I ] as training 1 ,I 2 ,…,I j ,…,I N J is more than or equal to 1 and less than or equal to N; verification set V = { I) with M image pairs as training results N+1 ,I N+2 ,…,I N+k ,…,I N+M },1≤k≤M,N≥180、M≥20;
(2) Binocular stereo matching network S based on dual-space pooling pyramid
(2a) Establishing an average pooling pyramid E2 AVG And a maximal pooling pyramid E2 MAX A double pyramid pooling network E2 formed by parallel connection;
(2b) Establishing a feature extraction network E formed by cascading an initial convolutional neural network E1, a double pooling pyramid network E2 and a feature fusion network E3;
(2c) Establishing a cost aggregation network R formed by cascading three stacked three-dimensional convolutional neural networks;
(2d) Cascading the feature extraction network E, the cost aggregation network R and the parallax regression network G to form a binocular stereo matching network S;
(3) Training a binocular stereo matching network S:
(3a) Setting learning rate eta and maximum training times t max The current iteration period t' =0;
(3b) For left eye image I in training set T i L With the right eye image I i R Predicted disparity map I using forward propagation method to obtain network output i pred
(3c) Computing a predicted disparity map I using the SmoothL1 loss function on the output image i pred And the real parallax map I i GT Error L between i And using a back-propagation gradient descent algorithm to the current network S t The weight parameter ω of the convolution kernel in (1) t And a convolution kernel bias parameter b t Updating is carried out;
(3d) Adding 1 to the current iteration cycle, namely t '= t' +1, and judging whether the maximum iteration time is reached:
if t' = t max If the number of iterations reaches the maximum number, the training is finished, and the trained binocular stereo matching network S is obtained t
If t' < t max And then returning to (3 b);
(4) Inputting the verification sample set V into a trained stereo matching network S of the double spatial pooling pyramid, and sequentially passing through a feature extraction network E, a cost aggregation network R and a parallax regression network G to obtain a prediction result I of the parallax map i pred And the prediction result is the result of stereo matching.
Compared with the prior art, the invention has the following advantages:
1. according to the invention, the double pyramid pooling network E2 is used, so that the multi-scale characteristics with rich texture information and global semantic information are provided for the binocular stereo matching network, and the stereo matching precision of stereo matching results at semantic edges, texture complex areas and texture missing areas is improved.
2. The invention realizes the extraction and further fusion of the features of the left eye image and the right eye image by using the feature extraction network E, improves the accuracy of the final stereo matching result, and effectively reduces wrong outliers in the matching result.
3. According to the invention, the cost aggregation network R and the parallax regression network G are used, so that the binocular images are matched, and the predicted parallax images are output.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of a pooling pyramid structure in the present invention;
fig. 3 is an output disparity map after stereo matching of left and right eye images using the simulation of the present invention.
Detailed Description
Embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, the implementation steps of this example are as follows:
step 1, acquiring a binocular stereo matching training sample set T and a verification sample set V.
A KITTI data set is obtained, the data set is established by Geiger et al of Carlsuhe university, and the data set is shot under three scenes, namely urban, rural and highway, by using an automobile provided with two high-definition color cameras, a Velodyne laser radar and a GPS. Information such as binocular images, disparity maps and light flow maps is provided through the data set, and the data set can be used for tasks such as stereo matching, light flow prediction, 3D object detection and 3D tracking. It is divided into two subdata sets, KITTI2012 and KITTI 2015. The KITTI2012 sub-data set provides 194 groups of training image groups containing binocular images and real disparity values, and 195 groups of verification image groups containing only binocular images and not disclosed by the real disparity values; the KITTI2015 is expanded on the basis, and 200 training image groups and 200 verification image groups are added.
N + M groups of image pairs I with resolution of WxH are obtained from KITTI binocular stereo matching data set i ={I i L ,I i R ,I i GT In which I i L 、I i R Left and right eye images, I, of the ith image pair i GT For the true parallax image of the ith image pair, W =512 and H =256 in the present embodiment.
Taking N =394 image pairs in the dataset as a training sample set T = { I = { (I) } 1 ,I 2 ,…,I j ,…,I N },1≤j≤N;
Validation set V = { I) with M =200 image pairs as training results N+1 ,I N+2 ,…,I N+k ,…,I N+M },1≤k≤M,N≥180、M≥20。
Taking a total of 394 training image groups in KITTI2012 and KITTI2015 as a training image set T,
200 verification images in KITTI2012 are taken as a verification set V 1
200 verification images in KITTI2015 are used as a verification set V 2
And 2, constructing a stereo matching network S based on the double spatial pooling pyramid.
2.1 With reference to FIG. 2, a pyramid E2 is built by averaging pooling AVG And a maximal pooling pyramid E2 MAX A double pyramid network E2 formed in parallel, wherein:
average pooling pyramid E2 AVG Four average pooled cores of 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 pooled cores each;
maximal pooling pyramid E2 MAX Four largest pooling cores of 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 pooling cores of each;
2.2 Establishing a feature extraction network E formed by cascading an initial convolutional neural network E1, a double-pooling pyramid network E2 and a feature fusion network E3, wherein:
the initial convolutional neural network E1 comprises three residual error units which are connected in sequence, each residual error unit comprises two groups of convolutional layers, and the first group of convolutional layers comprises two groups of 32 convolutional kernels which are connected in sequence and have the size of 3 x 3 and the step length of 2; the second group of convolution layers comprises three groups of 32 convolution kernels which are sequentially connected and have the size of 3 multiplied by 3 and the step length of 1;
the characteristic fusion network E3 comprises four groups of convolution layers which are connected in parallel, each group of convolution layers comprises two layers of convolution kernels which are connected in sequence, the size of each layer of convolution kernel is 3 multiplied by 3, the step length is 1, and the number of the convolution kernels is 32;
2.3 Establishing a cost aggregation network R formed by cascading three stacked three-dimensional convolutional neural networks, wherein each three-dimensional convolutional neural network is formed by three groups of convolutional layers which are connected in sequence, and each group of convolutional layers comprises 32 three-dimensional convolutional cores with the size of 3 multiplied by 3 and the step length of 1;
2.4 Establishing a parallax regression network G consisting of two groups of three-dimensional convolution layers which are connected in sequence, wherein the first group comprises 32 three-dimensional convolution kernels with the size of 3 x 3 and the step length of 1, and the second group comprises 1 three-dimensional convolution kernel with the size of 3 x 3 and the step length of 1;
2.5 The feature extraction network E, the cost aggregation network R and the parallax regression network G are cascaded to form a binocular stereo matching network S.
And 3, training the binocular stereo matching network S based on the double pooling pyramid.
3.1 Setting learning rate eta, maximum number of training times t max The current iteration period t' =0, η =0.001 in this embodiment max =500;
3.2 For left eye image I in training set T i L With the right eye image I i R Predicted disparity map I using forward propagation method to obtain network output i pred
3.2 a) separately for the left image I using an initial convolutional neural network E1 i L With the right image I i R Carrying out feature extraction to respectively obtain left image feature maps F i L And the feature map F of the right image i R
3.2 b) left and right eye feature map F using dual pooling pyramid network E2 i L 、F i R Performing pooling operation to obtain a left eye average pooling feature map group F i L_AVG Left eye maximum poolingFeature map set F i L_MAX And a set of right eye average pooling profiles F i R_AVG Right eye maximum pooling feature map set F i R_MAX Wherein:
pooling pyramid E2 in average space AVG The left image feature map and the right image feature map are subjected to average pooling operation by using four average pooling cores with different sizes to extract features containing semantic information, and a left eye average pooling feature map group F is obtained i L_AVG Right eye average pooling feature map set F i R_AVG
F i L_AVG ={F i L_avg1 ,F i L_avg2 ,F i L_avg3 ,F i L_avg4 }
F i R_AVG ={F i R_avg1 ,F i R_avg2 ,F i R_avg3 ,F i R_avg4 }
Wherein, F i L_avgi 、F i R_avgi I is more than or equal to 1 and less than or equal to 4 and is the output of the ith average pooling kernel;
pooling pyramid E2 in maximum space MAX The feature with texture and edge information is obtained by performing maximum pooling operation on the left image feature map and the right image feature map by using four maximum pooling kernels with different sizes, and a maximum pooling feature map group F of the left eye image is obtained i L_MAX Right eye image maximum pooling feature map group F i R_MAX
F i L_MAX ={F i L_max1 ,F i L_max2 ,F i L_max3 ,F i L_max4 }
F i R_MAX ={F i R_max1 ,F i R_max2 ,F i R_max3 ,F i R_max4 }
Wherein, F i L_maxi 、F i R_maxi I is more than or equal to 1 and less than or equal to 4 and is the output of the ith maximum pooling kernel;
3.2c) Stacking the feature maps obtained by the two pooling pyramids, passing through a feature fusion network E3, and outputting a left image feature map
Figure BDA0003574435120000061
And the right image multi-scale feature map
Figure BDA0003574435120000062
3.2D) setting a disparity search range D, where D =192 in this embodiment, first constructing a cost body of a current disparity value for each disparity value D in a splicing manner to obtain a three-dimensional matching cost body group C i ={C i 1 ,C i 2 ,…,C i d ,…,C i D D three-dimensional matching cost bodies C are sequentially arranged i d Splicing is carried out to construct a four-dimensional matching cost body
Figure BDA0003574435120000063
3.2 e) matching the four-dimensional matching cost body C with the cost aggregation network R i d Performing convolution operation to obtain aggregated cost body
Figure BDA0003574435120000064
3.2 f) Using the parallax regression network G, the aggregated cost body
Figure BDA0003574435120000065
Sequentially carrying out convolution and channel compression operation, and carrying out size recovery by using a bilinear interpolation method to obtain a parallax three-dimensional cost body C i pred
3.2 g) for three-dimensional cost body C i pred Obtaining a disparity map I predicted by a network by using a soft argmin operation i pred The formula is as follows:
Figure BDA0003574435120000071
3.3 Computing a predicted disparity map I using SmoothL1 loss function on the output image i pred And the real parallax map I i GT Error L between i The calculation formula is as follows:
Figure BDA0003574435120000072
3.4 Using a back-propagation gradient descent algorithm for the current network S t The weight parameter ω of the convolution kernel in (1) t And a convolution kernel bias parameter b t Updating is carried out, and the formula is as follows:
Figure BDA0003574435120000073
Figure BDA0003574435120000074
wherein, ω is t+1 And b t+1 Respectively representing the updated convolution kernel weight parameter and the updated convolution kernel bias parameter.
3.4 Add 1 to the current iteration cycle, i.e. t '= t' +1, and determine whether the maximum number of iterations is reached:
if t' = t max If the number of iterations reaches the maximum number, the training is finished, and the trained binocular stereo matching network S is obtained t
If t' < t max Then return to 3.2);
and 4, acquiring a binocular stereo matching verification set result by using the trained binocular stereo matching network.
Inputting the verification sample set V into the stereo matching network S of the trained dual-space pooling pyramid t Firstly, the image I passes through the characteristic extraction network E, and the left eye image I of the ith sample is respectively processed i L With the right eye image I i R Extracting left image feature map containing semantic information features and texture edge information features
Figure BDA0003574435120000075
And the right image multi-scale feature map
Figure BDA0003574435120000076
And constructing cost bodies of d parallax values by splicing the two characteristics, and constructing a four-dimensional matching cost body by splicing
Figure BDA0003574435120000077
Carrying out convolution operation on the four-dimensional cost body through a cost aggregation network R to obtain an aggregated cost body; the aggregated cost body is subjected to convolution and channel compression operation through a parallax regression network G, and dimension recovery is carried out on the cost body by utilizing a bilinear interpolation method to obtain a parallax three-dimensional cost body C i pred Then using soft argmin operation to the parallax three-dimensional cost body to obtain the result of stereo matching, namely the predicted parallax image I i pred
The effect of the present invention is further described below with the simulation experiment:
1. conditions of simulation experiment
The software platform of the simulation experiment is as follows: ubuntu 16.04LTS operating system, python3.7 programming language, pytorch deep learning framework;
the hardware platform of the simulation experiment is as follows: the CPU is a host computer of Intel i9-7700X, the memory is 32GB, the graphics processor is Ingland RTX 2080Ti, and the video memory is 11GB.
The data set for the simulation experiment was: the total 394 training image groups in KITTI2012 and KITTI2015 are training image sets T, and 200 verification images in KITTI2012 are verification sets V 1 200 verification images in KITTI2015 are verification set V 2
Validation set V at KITTI2012 1 In (1), the disparity value I is predicted i pred From the true disparity value I i GT Is greater than the area A of e pixels e-all Area S of Ae-all Account for image A all Total area S Aall Is defined as e-all evaluation index:
Figure BDA0003574435120000081
the parallax non-occlusion area A noc Region A with middle parallax error value larger than e pixel value e-noc Area S of Ae-noc Occupying the area S of the parallax non-occlusion region Anoc The ratio of (A) to (B) is defined as an e-noc evaluation index:
Figure BDA0003574435120000082
in this example e =2,3, four types of evaluation indices are available: a two-pixel global error 2-all, a three-pixel global error 3-all, a two-pixel non-closing error 2-noc, and a three-pixel non-closing error 3-noc;
validation set V at KITTI2015 2 In the area A, the error between the predicted disparity map and the real disparity map in the area A is more than 3 pixels or more than 5% of the real disparity value of the point t The ratio to the area of the region a is defined as the D1 error evaluation index within the region a:
Figure BDA0003574435120000083
S At is a region A t Area of (S) A The area of region A; in verification set V 2 In the area A, a foreground area of an image is respectively selected to calculate foreground error evaluation indexes D1-fg, a background area is selected to calculate background error evaluation indexes D1-bg, and an integral area is selected to calculate global error evaluation indexes D1-all.
2. Simulation content and analysis of results thereof
Simulation 1, taking 394 training image groups in total in KITTI2012 and KITTI2015 as training image sets T, training the constructed binocular stereo matching network S to obtain a trained stereo matching network S t Using 200 verification images in KITTI2012 as verification set V 1 Input to the trained stereo matching networkCollaterals of S t Obtaining a stereo matching result of the method, and calculating two-pixel global error 2-all, three-pixel global error 3-all, two-pixel non-closed error 2-noc and three-pixel non-closed error 3-noc evaluation indexes;
taking a paper 'Pyramid Stereo Matching Network' as a reference method, constructing a reference Network by using the method in the paper, training the reference Network by using a training image set T to obtain a trained reference Network, and verifying a set V 1 Inputting the three-dimensional matching result into a trained reference network to obtain a three-dimensional matching result of a reference method, and calculating two-pixel global error 2-all, three-pixel global error 3-all, two-pixel non-closed error 2-noc and three-pixel non-closed error 3-noc evaluation indexes;
the KITTI2012 test set is compared with the evaluation indexes obtained by the method of the invention through the reference method, and the results are shown in Table 1:
TABLE 1 comparison of the validation metrics on the KITTI2012 data set for the method of the present invention and the benchmark method
Figure BDA0003574435120000091
As can be seen from Table 1, validation set V is set 2012 in KITTI 1 In the method, the evaluation of the method is improved relative to the evaluation of a reference method, wherein the 2-noc index is improved by 1.18 percent, the 2-all index is improved by 0.23 percent, the 3-noc index is improved by 0.07 percent, and the 3-all index is improved by 0.16 percent.
Simulation 2. Using the trained stereo matching network S in simulation 1 t Taking 200 verification images in KITTI2015 as a verification set V 2 Input to the trained stereo matching network S t As shown in fig. 3, fig. 3 (a) is a left eye image to be matched, fig. 3 (b) is a right eye image to be matched, and fig. 3 (c) is an output disparity map obtained by performing stereo matching on the left eye image and the right eye image by using the simulation of the present invention. Fig. 3 (c) shows that the present invention can obtain better matching result with complex texture region at speech edge.
Calculating a foreground error evaluation index D1-fg, a background error evaluation index D1-bg and a global error evaluation index D1-all of a stereo matching result of the method;
using the trained reference network in simulation 1, the verification set V is 2 Inputting the three-dimensional matching result into a trained reference network to obtain a three-dimensional matching result of a reference method, and calculating a foreground error evaluation index D1-fg, a background error evaluation index D1-bg and a global error evaluation index D1-all;
the results of comparing the KITTI2015 test set with the evaluation indexes obtained by the method of the invention through the reference method are shown in Table 2:
TABLE 2 comparison of the validation index of the method of the invention with that of the reference method on KITTI2015 data set
Figure BDA0003574435120000101
As can be seen from Table 2, data set V was recorded at KITTI2015 2 The verification is carried out, the global error index D1-all is improved by 0.19 percent compared with the reference method, the background error index is improved by 0.06 percent compared with the reference method D1-bg, the foreground error index D1-fg is improved by 0.67 percent compared with the reference method,
the simulation results show that the data set V is verified in KITTI2012 by the method of the invention compared with the reference method 1 And KITTI2015 validation data set V 2 The error evaluation indexes are improved, so that the accuracy of the stereo matching result can be effectively improved, and a more accurate disparity map can be obtained.

Claims (9)

1. A stereo matching method based on a double-pooling pyramid structure is characterized by comprising the following steps:
(1) Obtaining a binocular stereo matching training sample set T and a verification sample set V
Acquiring N + M groups of image pairs I from existing binocular stereo matching data set i ={I i L ,I i R ,I i GT In which I is i L 、I i R Left and right eye images, I, of the ith image pair i GT True parallax images for the ith image pair;
taking N image pairs in the data set as a training sample set T = { I = { (I) 1 ,I 2 ,…,I j ,…,I N J is more than or equal to 1 and less than or equal to N; verification set V = { I) with M image pairs as training results N+1 ,I N+2 ,…,I N+k ,…,I N+M },1≤k≤M,N≥180、M≥20;
(2) Binocular stereo matching network S based on dual-space pooling pyramid
(2a) Establishing an average pooling pyramid E2 AVG And a maximal pooling pyramid E2 MAX A double pyramid network E2 formed by parallel connection;
(2b) Establishing a feature extraction network E formed by cascading an initial convolutional neural network E1, a double pooling pyramid network E2 and a feature fusion network E3;
(2c) Establishing a cost aggregation network R formed by cascading three stacked three-dimensional convolutional neural networks;
(2d) Cascading the feature extraction network E, the cost aggregation network R and the parallax regression network G to form a binocular stereo matching network S;
(3) Training a binocular stereo matching network S:
(3a) Setting learning rate eta and maximum training times t max The current iteration period t' =0;
(3b) For left eye image I in training set T i L With the right eye image I i R Predicted disparity map I output by network obtained by forward propagation method i pred
(3c) Computing a predicted disparity map I using the SmoothL1 loss function on the output image i pred And the real parallax map I i GT Error L therebetween i And using a back-propagation gradient descent algorithm to the current network S t The weight parameter ω of the convolution kernel in (1) t And a convolution kernel bias parameter b t Updating is carried out;
(3d) Adding 1 to the current iteration cycle, namely t '= t' +1, and judging whether the maximum iteration frequency is reached:
if t' = t max If so, the maximum iteration times are reached, and the training is finished at the moment to obtain the trained binocular stereo matching network
S t If t' < t max And then returning to (3 b);
(4) Inputting the verification sample set V into a trained stereo matching network S of the double spatial pooling pyramid, and sequentially passing through a feature extraction network E, a cost aggregation network R and a parallax regression network G to obtain a prediction result I of the parallax image i pred And the prediction result is the result of stereo matching.
2. The method according to claim 1, wherein the left eye image I in the training set T in the (3 b) is subjected to image matching i L With the right eye image I i R Predicted disparity map I output by network obtained by forward propagation method i pred The implementation is as follows:
(3b1) Respectively aligning the left image I by using an initial convolutional neural network E1 i L With right image I i R Carrying out feature extraction to respectively obtain left image feature maps F i L And the right image feature map F i R
(3b2) Performing pooling operation on the left and right eye feature maps by using a dual-pooling pyramid network E2 to obtain a left eye average pooled feature map group F i L_AVG Left eye maximum pooling feature map set F i L_MAX And a set of right eye average pooling profiles F i R_AVG Right eye maximum pooling feature map set F i R_MAX
(3b3) Stacking the feature maps obtained by the two pooling pyramids, passing through a feature fusion network E3, and outputting a left image feature map
Figure FDA0003574435110000021
And the right image multi-scale feature map
Figure FDA0003574435110000022
(3b4) Setting parallax search rangeD, constructing a cost body of the current disparity value by splicing each disparity value D to obtain a three-dimensional matching cost body group C i ={C i 1 ,C i 2 ,…,C i d ,…,C i D D three-dimensional matching cost bodies C are sequentially arranged i d Splicing is carried out to construct a four-dimensional matching cost body
Figure FDA0003574435110000023
1≤d≤D;
(3b5) Using cost aggregation network R to match with four-dimensional cost body C i d Performing convolution operation to obtain aggregated cost body
Figure FDA0003574435110000024
(3b6) Using a parallax regression network G to combine the aggregated costs
Figure FDA0003574435110000025
Sequentially carrying out convolution and channel compression operation, and carrying out size recovery by using a bilinear interpolation method to obtain a parallax three-dimensional cost body C i pred
(3b7) For three-dimensional cost body C i pred Obtaining a disparity map I predicted by a network by using a soft argmin operation i pred
3. The method according to claim 2, wherein the left and right eye feature maps in (3 b 2) are pooled using a dual-pooling pyramid network E2 as follows:
pooling pyramid E2 in average space AVG The left image feature map and the right image feature map are subjected to average pooling operation by using four average pooling cores with different sizes to extract features containing semantic information, and a left eye average pooling feature map group F is obtained i L_AVG Right eye average pooling feature map set F i R_AVG
F i L_AVG ={F i L_avg1 ,F i L_avg2 ,F i L_avg3 ,F i L_avg4 }
F i R_AVG ={F i R_avg1 ,F i R_avg2 ,F i R_avg3 ,F i R_avg4 }
Wherein, F i L_avgi 、F i R_avgi I is more than or equal to 1 and less than or equal to 4 and is the output of the ith average pooling kernel;
pooling of pyramids E2 in maximum space MAX The maximum pooling operation is carried out on the left image feature map and the right image feature map by using four maximum pooling kernels with different sizes to obtain features with texture and edge information, and a maximum pooling feature map group F of the left eye image is obtained i L_MAX Right eye image maximum pooling feature map group F i R_MAX
F i L_MAX ={F i L_max1 ,F i L_max2 ,F i L_max3 ,F i L_max4 }
F i R_MAX ={F i R_max1 ,F i R_max2 ,F i R_max3 ,F i R_max4 }
Wherein, F i L_maxi 、F i R_maxi I is more than or equal to 1 and less than or equal to 4 and is the output of the ith largest pooling kernel.
4. The method of claim 1, wherein the average pooling pyramid E2 in (2 a) is AVG And a maximal pooling pyramid E2 MAX The structure is as follows:
average pooling pyramid E2 AVG Four average pooled cores of 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 pooled cores each;
maximum pooling pyramid E2 MAX Four maximal pooled cores of size 4 × 4,8 × 8, 16 × 16, 64 × 64, respectively, with 32 per pooled core.
5. The method according to claim 1, wherein the initial convolutional neural network E1 and the feature fusion network E3 in (2 b) have the following structures:
the initial convolutional neural network E1 comprises three residual error units which are connected in sequence, each residual error unit comprises two groups of convolutional layers, and the first group of convolutional layers comprises two groups of 32 convolutional kernels which are connected in sequence and have the size of 3 x 3 and the step length of 2; the second group of convolution layers comprises three groups of 32 convolution kernels which are sequentially connected and have the size of 3 multiplied by 3 and the step length of 1;
the feature fusion network E3 comprises four groups of convolution layers connected in parallel, each group of convolution layers comprises two layers of convolution kernels connected in sequence, the size of each layer of convolution kernels is 3 multiplied by 3, the step length is 1, and the number is 32.
6. The method of claim 1, wherein each three-dimensional convolutional neural network in (2 c) is composed of three groups of convolutional layers connected in sequence, each group of convolutional layers containing 32 three-dimensional convolutional kernels with size of 3 x 3 and step size of 1.
7. The method according to claim 1, wherein the disparity regression network G in (2 d) is composed of two sequentially connected three-dimensional convolution layers, the first group comprises 32 three-dimensional convolution kernels with the size of 3 x 3 and the step size of 1, and the second group comprises 1 three-dimensional convolution kernel with the size of 3 x 3 and the step size of 1.
8. The method according to claim 1, wherein the calculating of the predicted disparity map I in (3 c) uses a SmoothL1 loss function i pred And the real parallax map I i GT Value of loss L in between i The calculation formula is as follows:
Figure FDA0003574435110000041
9. according toThe method of claim 1, wherein said step (3 c) comprises applying a back-propagation gradient descent algorithm to the current network S t The weight parameter ω of the convolution kernel in (1) t And a convolution kernel bias parameter b t Updating is carried out, and the formula is as follows:
Figure FDA0003574435110000042
Figure FDA0003574435110000043
wherein L is i For predicting disparity maps I i pred And the real parallax map I i GT Eta represents the network update learning rate, eta is more than or equal to 0.0001 and less than or equal to 0.001, omega t+1 And b t+1 Respectively representing the updated convolution kernel weight parameter and the updated convolution kernel bias parameter.
CN202210336322.8A 2022-03-31 2022-03-31 Stereo matching method based on double-space pooling pyramid Pending CN115375746A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210336322.8A CN115375746A (en) 2022-03-31 2022-03-31 Stereo matching method based on double-space pooling pyramid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210336322.8A CN115375746A (en) 2022-03-31 2022-03-31 Stereo matching method based on double-space pooling pyramid

Publications (1)

Publication Number Publication Date
CN115375746A true CN115375746A (en) 2022-11-22

Family

ID=84060238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210336322.8A Pending CN115375746A (en) 2022-03-31 2022-03-31 Stereo matching method based on double-space pooling pyramid

Country Status (1)

Country Link
CN (1) CN115375746A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117475182A (en) * 2023-09-13 2024-01-30 江南大学 Stereo matching method based on multi-feature aggregation
CN117475182B (en) * 2023-09-13 2024-06-04 江南大学 Stereo matching method based on multi-feature aggregation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117475182A (en) * 2023-09-13 2024-01-30 江南大学 Stereo matching method based on multi-feature aggregation
CN117475182B (en) * 2023-09-13 2024-06-04 江南大学 Stereo matching method based on multi-feature aggregation

Similar Documents

Publication Publication Date Title
CN113240691B (en) Medical image segmentation method based on U-shaped network
CN110689008A (en) Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
Mehta et al. Structured adversarial training for unsupervised monocular depth estimation
CN103310421B (en) The quick stereo matching process right for high-definition image and disparity map acquisition methods
CN110197505B (en) Remote sensing image binocular stereo matching method based on depth network and semantic information
CN111832655A (en) Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN111046767B (en) 3D target detection method based on monocular image
CN111127522B (en) Depth optical flow prediction method, device, equipment and medium based on monocular camera
Li et al. ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion
CN111583313A (en) Improved binocular stereo matching method based on PSmNet
Chen et al. Depth completion using geometry-aware embedding
CN111462211B (en) Binocular parallax calculation method based on convolutional neural network
Wei et al. Bidirectional hybrid lstm based recurrent neural network for multi-view stereo
CN112862949A (en) Object 3D shape reconstruction method based on multiple views
CN115359191A (en) Object three-dimensional reconstruction system based on deep learning
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN114463492A (en) Adaptive channel attention three-dimensional reconstruction method based on deep learning
Zhang et al. Pa-mvsnet: Sparse-to-dense multi-view stereo with pyramid attention
Hirner et al. FC-DCNN: A densely connected neural network for stereo estimation
CN113780389A (en) Deep learning semi-supervised dense matching method and system based on consistency constraint
WO2022120988A1 (en) Stereo matching method based on hybrid 2d convolution and pseudo 3d convolution
CN111368882B (en) Stereo matching method based on simplified independent component analysis and local similarity
CN115965961B (en) Local-global multi-mode fusion method, system, equipment and storage medium
CN104616304A (en) Self-adapting support weight stereo matching method based on field programmable gate array (FPGA)
CN116778091A (en) Deep learning multi-view three-dimensional reconstruction algorithm based on path aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination