CN114972822A

CN114972822A - End-to-end binocular stereo matching method based on convolutional neural network

Info

Publication number: CN114972822A
Application number: CN202210659456.3A
Authority: CN
Inventors: 刘杰; 高晶
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-08-30

Abstract

The invention discloses an end-to-end binocular stereo matching method based on a convolutional neural network, which improves the existing PsmNet network model for parallax estimation. Firstly, adding cavity convolution in a multi-scale space pyramid pooling layer to enlarge the receptive field of the network; and in the cost aggregation module, 4 coding and decoding modules are connected in series for stacking, so that high-level information is further extracted. The improved network model enlarges the receptive field, simultaneously improves the good performance of the network model, increases rich detail information and solves the problem that the shielded and weak texture areas can not be matched correctly.

Description

End-to-end binocular stereo matching method based on convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an end-to-end binocular stereo matching method based on a convolutional neural network.

Background

The purpose of stereo matching is to obtain disparity values of image points in a stereo image pair, thereby performing depth information calculation, which is the core of many computer vision applications, such as autopilot, robotic navigation, binocular ranging, three-dimensional reconstruction, and the like. The stereo matching is divided into a traditional method and a deep learning method, and the deep learning is divided into a non-end-to-end method and an end-to-end method. The traditional stereo matching algorithm has low precision and low processing speed, which greatly limits the application of the stereo matching algorithm in actual scenes. In recent years, with the development of massively parallel computing devices, methods based on deep learning have made breakthrough progress in numerous visual tasks. The convolutional neural network has the advantages of high processing speed, strong robustness and the like, and is very suitable for the requirements of a stereo matching algorithm, so that the convolutional neural network gradually becomes the mainstream research direction of the stereo matching algorithm. However, the end-to-end algorithm is more convenient for optimizing the whole algorithm compared with the non-end-to-end algorithm, and thus is more convenient in practical application. However, there still exist many problems in the end-to-end binocular stereo matching, which affect the accuracy and speed of stereo matching, such as weak texture, repeated texture, and low matching rate at the edge of an object.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an end-to-end binocular stereo matching method based on a convolutional neural network, and the existing PsmNet network model for parallax estimation is improved. Firstly, adding cavity convolution into a multi-scale space pyramid pooling layer to expand the receptive field of the network; and in the cost aggregation module, 4 coding and decoding modules are connected in series for stacking, so that high-level information is further extracted. The improved network model enlarges the receptive field, simultaneously improves the good performance of the network model, increases rich detail information and solves the problem that the shielded and weak texture areas can not be matched correctly. The method is realized by the following steps:

(1) collecting a data set and preprocessing the data set;

(1-1) collecting a data set: the data set is derived from two open source data sets, SceneFlow and KITTI 2015, the former including a training set and a validation set, the latter including a training set and a test set;

(1-2) pretreatment: randomly cutting each input left view and right view in the data set to 256 multiplied by 512, and then carrying out normalization operation on the views;

(2) constructing a stereo matching network, wherein the stereo matching network comprises a feature extraction module, a feature fusion module, a cost construction module, a cost aggregation module and a parallax regression module;

(2-1) constructing a feature extraction module: the feature extraction module is a twin network sharing weight and is used for extracting features of an input left view and an input right view, wherein the input left view and the input right view are to be matched, and the output left view and the input right view are two unary features; the twin network firstly utilizes 3 convolutional layers to carry out down sampling on the input left view and the input right view once, the convolutional kernel of each convolutional layer is 3 multiplied by 3, and the step length is 2; next, 4 residual layers are further processed on the input left and right views, wherein the first residual layer comprises 3 residual blocks and the second residual layer comprises 16 residual blocks; the third residual layer comprises 3 residual blocks, and the fourth residual layer comprises 3 residual blocks; convolution kernels of the four residual error layers are all 3 multiplied by 3, characteristic dimensions are all 32, step lengths of residual error blocks in the second residual error layer are 2, and step lengths of the residual error blocks are all 1; each residual block structure is BN-conv-BN-ReLU-conv-BN, wherein BN, conv and ReLU refer to batch normalization, convolution layer and modified linear unit respectively. After the convolution operation, the output of the twin network is two unary features with the size of H/4 xW/4 xF, wherein H, W represents the height and width of the original input image respectively, and F represents the feature dimension;

(2-2) constructing a feature fusion module: the feature fusion module is used for performing multi-scale pooling operation on the features obtained in the previous step, performing pooling operation on each layer by adopting hole convolution, then performing feature fusion, and fusing the four-scale features obtained by pooling and the output of the second layer and the fourth layer of the residual error layer by using convolution kernel of 1x1, wherein the output is two unary features with the size of H/4 xW/4 xF, H, W respectively represents the height and the width of an original input image, and F represents a feature dimension;

(2-3) constructing a cost body module: the construction cost body module is used for calculating the matching cost of the two feature graphs, the input of the construction cost body module is two feature graphs containing context information, the output of the construction cost body module is a four-dimensional tensor, and the specific calculation process comprises the following steps: connecting a reference feature map containing context information with a corresponding target feature map containing context information under each possible parallax, and packaging the reference feature map and the target feature map into a 4-dimensional cost body, wherein the dimension of the cost body output by a cost body module is H/4 xW/4 xD/4 xF, H, W represents the height and the width of an original input image respectively, D represents the maximum possible parallax value, and F represents a feature dimension;

(2-4) constructing a cost aggregation module: the cost aggregation module is an encoding and decoding structure, and has four encoding and decoding structures in total, and is used for learning a regular function on a cost body to perform cost aggregation, wherein the input of the cost body is the cost body, and the output of the cost body is a regular characteristic graph; firstly, performing convolution on the obtained cost body by using 2 3D convolutional layers, wherein each 3D convolutional layer is 2 convolution kernels with the size of 3 multiplied by 3, the characteristic dimensionality is 32, and the output of the 1 st 3D convolutional layer is superposed on the output of the 2 nd 3D convolutional layer; stacking the four encoding and decoding structures in series, wherein the encoding and decoding structures comprise two stages of encoding and decoding, and the encoding stage comprises 4 3D convolutional layers; only two 3D deconvolution layers are applied to carry out upsampling in a decoding stage, and for the first deconvolution layer, a feature map of a corresponding dimension is added from an encoding stage so as to retain rough high-layer information and detailed low-layer information; finally, two 3D convolutional layers are utilized to further reduce the feature dimension to obtain a regularized feature map, wherein the regularized feature map dimension is H/4 xW/4 xD/4 x1, H, W respectively represents the height and width of the original input image, and D represents the maximum possible parallax value;

(2-5) constructing a parallax regression module: the parallax regression module is used for taking an inverse number for the value of the matching cost body and converting the matching cost body into corresponding matching probability by utilizing a softmax function; the input of the method is a regularized feature map, and the output of the method is a disparity map with dimensions H multiplied by W, wherein H, W respectively represent the height and width of an original input image;

(3) training a model;

(3-1) determining parameter settings of the network model;

the parameter setting of the network model comprises the steps of selecting Adam as an optimizer, setting the learning rate to be 1e-4 and the maximum training round to be 10;

(3-2) sending the preprocessed left and right views into model training;

firstly, inputting the left and right views of a training data set in preprocessed sceneFlow into a stereoPerforming forward propagation calculation in a model of the matching network to obtain a final disparity map; then, the output final disparity map and the real disparity map are input to a loss function

In the method, a batch gradient descent method is used for back propagation to obtain a pre-training model, and then the pre-processed KITTI training set data is trained on the pre-training model until

Converging to obtain a final stereo matching network model;

(4) and carrying out binocular stereo matching by using the trained stereo matching network model.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of an overall algorithm of an end-to-end binocular stereo matching method based on a convolutional neural network according to the present invention;

FIG. 2 is a network structure diagram of an end-to-end binocular stereo matching method based on a convolutional neural network according to the present invention;

Detailed Description

Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the parts closely related to the scheme according to the present invention are shown in the drawings, and other details not so much related to the present invention are omitted.

The first embodiment is as follows:

in this embodiment, an end-to-end binocular stereo matching method based on a convolutional neural network is described with reference to fig. 1, and the method includes the following steps:

step one, collecting a data set and preprocessing: the data set is derived from two open source data sets, namely a SceneFlow and a KITTI 2015, wherein the former comprises a training set and a verification set, the latter comprises a training set and a testing set, and the network training is performed under a pytorch framework;

step two, constructing a stereo matching network: the stereo matching network comprises a feature extraction module, a feature fusion module, a cost body construction module, a cost aggregation module and a parallax regression module;

step three, model training: inputting the left and right views of the preprocessed training data set into a model of a stereo matching network for forward propagation calculation to obtain a final disparity map; then, inputting the output final disparity map and the real disparity map into a loss function, and performing backward propagation by using a batch gradient descent method until the model converges;

and step four, performing binocular stereo matching by using the trained stereo matching network model.

The second embodiment is as follows:

on the basis of the first specific embodiment, with reference to fig. 2, the second specific method for constructing the feature extraction module, the feature fusion module, the cost body construction module, the cost aggregation module and the parallax regression module in the stereo matching network in the step two of the end-to-end binocular stereo matching method based on the convolutional neural network is as follows:

the feature extraction module is a twin network sharing weight and is used for extracting features of an input left view and an input right view, wherein the input left view and the input right view are to be matched, and the output left view and the input right view are two unary features; the twin network firstly utilizes 3 convolutional layers to carry out down sampling on the input left view and the input right view once, the convolutional kernel of each convolutional layer is 3 multiplied by 3, and the step length is 2; next, 4 residual layers are further processed on the input left and right views, wherein the first residual layer comprises 3 residual blocks and the second residual layer comprises 16 residual blocks; the third residual layer comprises 3 residual blocks, and the fourth residual layer comprises 3 residual blocks; convolution kernels of the four residual error layers are all 3 multiplied by 3, characteristic dimensions are all 32, step lengths of residual error blocks in the second residual error layer are 2, and step lengths of the residual error blocks are all 1; each residual block structure is BN-conv-BN-ReLU-conv-BN, wherein BN, conv and ReLU refer to batch normalization, convolution layer and modified linear unit respectively. After the convolution operation, the output of the twin network is two unary features with the size of 64 multiplied by 128, wherein 64, 128 and 128 respectively represent the height, width and feature dimensions of the features in turn;

the characteristic fusion module performs multi-scale pooling operation on the obtained characteristics, wherein each layer performs pooling operation by adopting cavity convolution, the expansion rates are respectively 6,12,18 and 24, and the step length is 1; performing feature fusion, and fusing the four-scale features obtained by the pooling and the outputs of the second layer and the fourth layer of the residual block by using a convolution kernel of 1 × 1, wherein the outputs are two unary features with the size of 64 × 128 × 32, and 64, 128 and 32 sequentially represent the height and width of the features and the feature dimensions respectively;

the construction cost body module is used for calculating the matching cost of two feature maps, the input of the construction cost body module is two feature maps containing context information, the output of the construction cost body module is a four-dimensional tensor, and the specific calculation process comprises the following steps: connecting a reference feature map containing context information with a corresponding target feature map containing context information under each possible parallax, and packaging the reference feature map and the target feature map into a 4-dimensional cost body, wherein the dimension of the cost body output by a cost body module is 64 multiplied by 128 multiplied by 48 multiplied by 64, and 64, 128, 48 and 64 sequentially represent the height and width of a feature, the maximum possible parallax value and the feature dimension respectively;

the cost aggregation module is of an encoding and decoding structure and is used for learning a regular function on a cost body to carry out cost aggregation, the input of the cost body is the cost body, and the output of the cost body is a regularized feature map; firstly, convolving the obtained cost body by using 2 3D convolution layers, wherein each 3D convolution layer uses 2 convolution kernels of 3x3, the characteristic dimensions are all 32, and the output of the 1 st 3D convolution layer is superposed on the output of the 2 nd 3D convolution layer; then, four encoding and decoding structures are connected in series to be stacked, wherein the encoding and decoding structures comprise an encoding stage and a decoding stage, the encoding stage comprises 4 3D convolutional layers, the convolutional layers are 3x3x3, the step length of the first convolutional layer and the third convolutional layer is 2, and the rest step length is 1; only two 3D deconvolution layers are applied to carry out upsampling in a decoding stage, convolution kernels are all 3x3x3, the step length is 2, and for the first deconvolution layer, a feature map of corresponding dimension is added from an encoding stage to reserve rough high-layer information and detailed low-layer information; for the output of each coding and decoding module, adding the output result of the cost body after two convolutions, wherein the convolution kernels of the two convolutions are both 3x3x3, and the step length is 1; finally, two 3D convolutional layers are utilized to further reduce the feature dimension to obtain a regularized feature map, wherein the regularized feature map dimension is 64 × 128 × 48 × 1, and 64, 128, 48 and 1 respectively represent the height and width of the feature, the maximum possible parallax value and the feature dimension in sequence;

the parallax regression module is used for taking an inverse number for the value of the matching cost body and converting the matching cost body into corresponding matching probability by utilizing a softmax function; its input is the regularized feature map and the output is a disparity map with dimensions H × W, where H, W represents the height and width, respectively, of the original input image.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications and substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An end-to-end binocular stereo matching method based on a convolutional neural network is characterized by comprising the following steps:

step 1: collecting and preprocessing a data set, wherein the data set is derived from two open source data sets, namely a scenflow and a KITTI 2015, the data set comprises a training set and a verification set, the data set comprises a training set and a testing set, and network training is performed under a pytorch framework;

step 2: constructing a stereo matching network, wherein the stereo matching network comprises a feature extraction module, a feature fusion module, a cost constructing module, a cost aggregation module and a parallax regression module;

and step 3: model training, namely inputting the left view and the right view of the preprocessed training data set into a model of a stereo matching network for forward propagation calculation to obtain a final disparity map; then, inputting the output final disparity map and the real disparity map into a loss function, and performing backward propagation by using a batch gradient descent method until the model converges;

and 4, step 4: and carrying out binocular stereo matching by using the trained stereo matching network model.

2. The convolutional neural network-based end-to-end binocular stereo matching method as claimed in claim 1, wherein the preprocessing in step 1 is implemented by the following steps:

(1) randomly clipping each input left view and right view in the data set to 256 multiplied by 512;

(2) and carrying out normalization operation on the cut picture.

3. The end-to-end binocular stereo matching method based on the convolutional neural network as claimed in claim 1, wherein the step 2 of constructing the stereo matching network is implemented by the following steps:

(1) a feature extraction module:

(1-1) downsampling the input left and right views once using 3 convolutional layers, each of which downsamples the input left and right views using 3 convolutional layers of 3 × 3 with a step size of 2;

(1-2) further processing the input left and right views with 4 residual layers, wherein a first residual layer includes 3 residual blocks and a second includes 16 residual blocks; the third comprises 3 residual blocks and the fourth comprises 3 residual blocks; convolution kernels of all residual error layers are 3 multiplied by 3, characteristic dimensions are 32, the step length of a residual error block in the second residual error layer is 2, and the step length of the rest residual error blocks is 1; after the convolution operation, the output is two unary features with the size of H/4 xW/4 xF, wherein H, W respectively represents the height and width of the original input image, and F represents the feature dimension;

(2) a feature fusion module:

(2-1) performing multi-scale pooling operation on the obtained features, wherein each layer is subjected to pooling operation by adopting hole convolution;

(2-2) performing feature fusion on the multi-scale features obtained by the pooling and the outputs of the second layer and the fourth layer of the residual layer by using a convolution kernel of 1x1, wherein the outputs are two unary features with the size of H/4 xW/4 xF, H, W respectively represents the height and width of an original input image, and F represents a feature dimension;

(3) constructing a cost body module:

(3-1) connecting the reference feature map containing the context information with the corresponding target feature map containing the context information at each possible disparity;

(3-2) packing the output feature map into a 4-dimensional cost body, wherein the dimensions of the cost body output by the cost body module are H/4 xW/4 xD/4 xF, wherein H, W respectively represents the height and the width of an original input image, D represents the maximum possible parallax value, and F represents a feature dimension;

(4) a cost aggregation module:

(4-1) convolving the obtained cost volume by using 2 3D convolutional layers, wherein each 3D convolutional layer is 2 convolution kernels with the size of 3x3, the characteristic dimension is 32, and the output of the 1 st 3D convolutional layer is superposed on the output of the 2 nd 3D convolutional layer;

(4-2) stacking four coding and decoding structures in series, wherein the coding and decoding structures comprise two stages of coding and decoding, and the coding stage comprises 4 3D convolutional layers; only two 3D deconvolution layers are applied to carry out upsampling in a decoding stage, and for the first deconvolution layer, a feature map of a corresponding dimension is added from an encoding stage so as to retain rough high-layer information and detailed low-layer information;

(4-3) for each codec structure, obtaining a regularized feature map by further reducing feature dimensions using two 3D convolutional layers, the regularized feature map dimensions being H/4 xw/4 xd/4 x1, where H, W represents the height and width of the original input image, respectively, and D represents the largest possible disparity value;

(5) a parallax regression module:

(5-1) taking an inverse number of the value of the matching cost body;

(5-2) converting the matching cost body into a corresponding matching probability by using a softmax function; its input is the regularized feature map and the output is a disparity map with dimensions H × W, where H, W represents the height and width, respectively, of the original input image.

4. The convolutional neural network-based end-to-end binocular stereo matching method as claimed in claim 1, wherein the model training in step 3 is implemented by the following steps:

(1) firstly, inputting the left and right views of a preprocessed training data set into a model of a stereo matching network for forward propagation calculation to obtain a final disparity map;

(2) inputting the output final disparity map and the real disparity map into a loss function, and performing back propagation by using a batch gradient descent method until the loss function is converged to obtain a pre-training model;

(3) and training the preprocessed KITTI training set data to the pre-training model until the loss function is converged to obtain a final stereo matching network model.

5. The end-to-end binocular stereo matching method based on the convolutional neural network as claimed in claim 3, wherein the multi-scale pooling operations in the feature fusion module are all downsampled by using hole convolution, the expansion rates are respectively 6,12,18 and 24, and the step lengths are all 1.

6. The convolutional neural network-based end-to-end binocular stereo matching method of claim 3, wherein the encoding stage in the cost aggregation module has four codec structures and each employs 4 convolutional layers with convolutional kernels of 3x3x3, wherein the step size of the first and third convolutional layers is 2, and the remaining step sizes are 1; in the decoding stage, 2 deconvolution layers with convolution kernels of 3x3x3 are applied, each with a step size of 2.