CN111951319A

CN111951319A - Image stereo matching method

Info

Publication number: CN111951319A
Application number: CN202010847540.9A
Authority: CN
Inventors: 周杰; 李永强; 郭振华
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-11-17

Abstract

The application belongs to the technical field of image processing, and particularly relates to an image stereo matching method. The existing stereo matching method based on deep learning has poor generalization capability, and the main reason is that data in the stereo matching field has certain defects, such as over-strong reflection, shielding and the like, and on the other hand, a region lacking texture also exists in an image, so that some methods for adding neighborhood constraint are easy to overfit. Meanwhile, high-quality data sets in the field of stereo matching are relatively short, and a network with generalization capability is difficult to obtain through simple training. The application provides an image stereo matching method, which comprises the following steps: constructing a training image library; enhancing a training image; extracting point (line) features for all images; training a feature point extraction network; extracting unary characteristics of the binocular image; an iteration module acquires a coarse-precision disparity map; polymerization of univariate characteristics; performing parallax regression; and (5) refining the parallax. The problem of overfitting in a stereo depth matching algorithm is solved.

Description

Image stereo matching method

Technical Field

The application belongs to the technical field of image processing, and particularly relates to an image stereo matching method.

Background

The stereo matching can be applied to scenes such as automatic driving and three-dimensional reconstruction. In the stereo matching task, the problems of shading and light reflection and lack of textures have great influence on the matching result (for example, as shown in fig. 1, the automobile glass in an outdoor road scene has light reflection). In an ideal case, the corrected binocular images are kept uniform in the vertical direction, and there is only parallax in the horizontal direction. However, in practical situations, the correction of the cameras is not perfect, the parameters of the two cameras are not completely consistent, and on the other hand, the scenes seen from different viewing angles are not completely consistent, and the scene content and the lighting condition are different, which may cause some differences in the images captured by the binocular cameras, and bring some challenges to robust stereo matching.

The existing scheme for stereo matching by using a depth network generally extracts features of each pixel point or pixel region in an image, calculates similarity with all pixel points or pixel regions in the same horizontal direction in another image, and generates a parallax estimation result end to end. Some methods also combine the traditional stereo matching idea to apply neighborhood constraint to the matching of pixel points or pixel regions, and typical methods include adding edge constraint, neighborhood constraint and the like.

The existing stereo matching method based on deep learning has poor generalization capability, and the main reason is that data in the stereo matching field has certain defects, such as over-strong reflection, shielding and the like, and on the other hand, a region lacking texture also exists in an image, so that some methods for adding neighborhood constraint are easy to overfit. Meanwhile, high-quality data sets in the field of stereo matching are relatively short, and a network with generalization capability is difficult to obtain through simple training.

Disclosure of Invention

1. Technical problem to be solved

The existing stereo matching method based on deep learning is poor in generalization capability, and the main reason is that certain defects exist in data in the stereo matching field, such as too strong light reflection, shading and the like, and on the other hand, a region lacking texture also exists in an image, so that some methods for adding neighborhood constraints are easy to overfit. Meanwhile, the high-quality data set in the stereo matching field is relatively lack, and a network with generalization capability is difficult to obtain through simple training.

2. Technical scheme

In order to achieve the above object, the present application provides an image stereo matching method, including the steps of:

step 1: constructing a training image library;

step 2: enhancing a training image;

and step 3: extracting point (line) features for all images;

and 4, step 4: training a feature point extraction network;

and 5: extracting unary characteristics of the binocular image;

step 6: an iteration module acquires a coarse-precision disparity map;

and 7: polymerization of univariate characteristics;

and 8: performing parallax regression;

and step 9: and (5) refining the parallax.

Another embodiment provided by the present application is: and (3) binocular data and monocular data of the image library in the step 1. The binocular image comprises a left image and a right image, and the used data sets comprise a Sceneflow synthetic data set, a Kitti automatic driving data set, a Middlebury data set, an Eth3d data set and other binocular data sets. The monocular data set used includes a general image data set (which may be referred to as a monocular image) such as Imagenet, Coco, and the like.

Another embodiment provided by the present application is: the enhancing in step 2 comprises applying random brightness variations, occlusions or random offsets in the vertical direction.

Another embodiment provided by the present application is: the extracted point features in the step 3 adopt traditional feature detection operators, such as ORB, SIFT and the like, further consider that the point features have poor effect on regions lacking textures, add line feature detection, classify the regions obtained by line feature detection into the range of the feature points, and use the extracted feature points in a subsequent training feature point extraction network.

Another embodiment provided by the present application is: the feature point extraction module in the step 4 comprises a series of 2D convolution layers and pooling layers, and labels used for training are feature points extracted by a traditional operator, including points extracted by a point feature and line method. The partial feature point extraction module can also directly use a traditional operator, and the dimensions of the unary feature descriptors generated by the two methods are required to be kept consistent.

Another embodiment provided by the present application is: the features generated by the feature point extraction module in the step 4 are divided into two parts, one part is used as a hidden vector h and is input into the iteration module, and the other part is directly input into the iteration module.

Another embodiment provided by the present application is: the unitary feature extraction module in step 5 is based on a resnet structure, and the unitary feature refers to a feature vector for describing each pixel point, because in stereo matching, a pixel point is taken as a unit, and the goal is to find a corresponding pixel point in a right image for all pixel points in a drawing.

Another embodiment provided by the present application is: and L in the iteration module in the step 6 represents that the Lookup operation is carried out on the four-dimensional matching cost, a search range radius r is defined, and the radius r is used as an index to obtain the matching cost of one r dimension from the matching. And matching the cost and the input disparity value, inputting the hidden vector h into an iteration module, and outputting a disparity increment delta d and an updated hidden vector h', wherein the initial disparity value is set to be 0.

Another embodiment provided by the present application is: the deformable convolution module in the step 7 performs aggregation on the unary features, where feature aggregation refers to binding the disparity value of a single pixel with disparity values of other pixels in the image to some extent, so as to avoid error results generated in some abnormal regions and improve the robustness of the system.

Another embodiment provided by the present application is: the regression of the parallax in the step 8 is to find each parallax value d in the range of the parallax_iWeighted sum with its corresponding weight, its disparity value d_iIs d_minTo d_maxAnd calculating probability values of the matching costs in the parallax range by positive integers, and weighting and summing to obtain corresponding parallax values.

Another embodiment provided by the present application is: the parallax refinement in the step 9 further refines the calculated parallax value by combining the input image, and particularly, in the case of pursuing timeliness, the network may output a parallax map with a low resolution, and the refinement module performs up-sampling on the obtained parallax map by combining the original input map to obtain a parallax map with a resolution consistent with that of the original input image.

3. Advantageous effects

Compared with the prior art, the image stereo matching method provided by the application has the beneficial effects that:

the image stereo matching method provided by the application is a binocular image stereo matching method with high robustness and high speed.

The image stereo matching method provided by the application solves the problem of overfitting in a stereo depth matching algorithm.

The image stereo matching method has a wide application prospect in various fields in view of depth information, and has strong theoretical significance and practical value.

According to the image stereo matching method, a large-scale training image database is built, real and synthetic data are used for combined training, a traditional feature detection algorithm is combined, iterative updating is carried out, a deformable convolution is used for replacing a 3D convolution module, and a disparity map is obtained rapidly and robustly.

The image stereo matching method can obtain the depth map under the condition of reference in the known camera, and the depth information has wide application prospects in the aspects of automatic driving, navigation and the like.

Drawings

FIG. 1 is an exemplary schematic diagram of a Kitti dataset image of the present application;

FIG. 2 is an exemplary schematic diagram of a Sceneflow dataset image of the present application;

FIG. 3 is a schematic diagram illustrating an example of a Middlebury dataset image of the present application;

FIG. 4 is a diagram illustrating the results of ORB feature point detection in the present application;

FIG. 5 is a schematic diagram of line feature detection results of the present application;

fig. 6 is a schematic network structure diagram of the image stereo matching method of the present application;

FIG. 7 is a schematic view of the parallax resulting from FIG. 1 of the present application;

fig. 8 is a flowchart illustrating an image stereo matching method according to the present application.

Detailed Description

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and it will be apparent to those skilled in the art from this detailed description that the present application can be practiced. Features from different embodiments may be combined to yield new embodiments, or certain features may be substituted for certain embodiments to yield yet further preferred embodiments, without departing from the principles of the present application.

ORB (organized FAST and rotaed BRIEF) is an algorithm for FAST feature point extraction and description. This algorithm was developed by Ethan ruble, Vincent Rabaud, Kurt Konolige and Gary r.bradski in 2011 under the name "ORB: an Efficient to SIFtor SURF:// www.willowgarage.com/sites/default/files/orb _ final. pdf). The ORB algorithm is divided into two parts, namely feature point extraction and feature point description. The feature extraction is developed by fast (features from accessed Segment test) algorithm, and the feature point description is improved according to brief (binary Robust independent feature features) feature description algorithm. The ORB feature is to combine the detection method of FAST feature points with BRIEF feature descriptors and make improvements and optimization on the original basis. The ORB algorithm is characterized by fast calculation speed. This is first of all benefited by the use of FAST to detect feature points, which is notoriously FAST just like its name. And again, the BRIEF algorithm is used for calculating the descriptor, and the expression form of the 2-system string specific to the descriptor not only saves the storage space, but also greatly shortens the matching time.

Referring to fig. 1 to 8, the present application provides an image stereo matching method, including the following steps:

step 1: constructing a training image library;

step 2: enhancing each training image;

and step 3: training point (line) feature extraction networks on a single graph;

and 4, step 4: extracting characteristic points and training a network;

and 5: unitary feature extraction, and construction of four-dimensional matching cost;

step 6: an iterative network module obtains a coarse precision disparity map;

and 7: the deformable convolution module carries out aggregation on the unary features;

and 8: performing parallax regression;

and step 9: and (5) refining the parallax.

Further, the image library in the step 1 comprises binocular data and monocular data. The binocular image comprises a left image and a right image, and the used data sets comprise a Sceneflow synthetic data set, a Kitti automatic driving data set, a Middlebury data set, an Eth3d data set and other binocular data sets. The monocular data set used includes a general image data set (which may be referred to as a monocular image) such as Imagenet, Coco, and the like.

Further, the enhancing in step 2 includes applying random brightness variation, occlusion or random offset in the vertical direction.

Further, the extracted point features in step 3 adopt a traditional feature detection operator, such as ORB, SIFT, and the like, and further consider that the point features do not have a good effect on regions lacking texture, line feature detection is added here, the regions obtained by line feature detection are classified into the range of the feature points, and the extracted feature points are used for subsequently training a feature point extraction network.

Further, the feature point extraction module in step 4 includes a series of 2D convolution layers and pooling layers, and the labels used for training are feature points extracted by traditional operators, including points extracted by a point feature and line method. The partial feature point extraction module can also directly use a traditional operator, and the dimensions of the unary feature descriptors generated by the two methods are required to be kept consistent.

Further, the unitary feature extraction module in step 5 is based on a resnet structure, and the unitary feature refers to a feature vector for describing each pixel point, because in stereo matching, a pixel point is used as a unit, and the goal is to find a corresponding pixel point in a right image for all pixel points in a mapping.

Further, L in the iteration module in step 6 represents that a Lookup operation is performed on the four-dimensional matching cost, a search range radius r is defined, and the search range radius r is used as an index to obtain a matching cost of r dimensions from matching. And matching the cost and the input disparity value, inputting the hidden vector h into an iteration module, and outputting a disparity increment delta d and an updated hidden vector h', wherein the initial disparity value is set to be 0.

Further, the deformable convolution module in step 7 performs aggregation on the unary features, where feature aggregation refers to binding the disparity value of a single pixel with disparity values of other pixels in the image to some extent, so as to avoid some abnormal regions from generating erroneous results, and improve the robustness of the system.

Further, the disparity regression in step 8 is to find each disparity value d in the disparity range_iWeighted sum with its corresponding weight, its disparity value d_iIs d_minTo d_maxAnd calculating probability values of the matching costs in the parallax range by positive integers, and weighting and summing to obtain corresponding parallax values.

Further, the parallax refinement in step 9 further refines the obtained parallax value in combination with the input image, and particularly, in the case of pursuing timeliness, the network may output a parallax map with a low resolution, and the refinement module performs up-sampling on the obtained parallax map in combination with the original input map to obtain a parallax map with a resolution consistent with that of the original input image.

Examples

(1) And constructing a training image library. There is a certain difficulty in acquiring binocular images including depth, and thus there are few binocular data sets currently available. The method mainly comprises two current methods for obtaining a depth map, namely, a laser radar is used; and secondly, an infrared depth sensor is used, wherein the infrared depth sensor obtains a sparse depth map, and the infrared depth sensor cannot work effectively outdoors. The lack of data limits the universality of the algorithm, and the current general method is to perform pre-training on a synthetic data set and then fine-tune on a small number of real data sets. Due to the insufficiency of data, the generalization capability of the deep learning stereo matching algorithm is greatly limited, and completely wrong results are often obtained for unseen scenes.

The part jointly uses various data such as Secenflow, Kitti, Middlebury and the like, so that sufficient data are improved for the training of the model. Further, considering that binocular data are difficult to widely contain various scenes, part of monocular data sets such as Imagenet and Coco are collected and used for training the feature point extraction module, and the generalization capability of the model is improved.

(2) And (5) enhancing the image. For each training image, random intensity variations, occlusions, random offsets in the vertical direction were applied. In a binocular camera, brightness variation may exist in two images, the images in the views are not identical due to occlusion, and meanwhile, errors may exist in the correction of stereo matching, so random offset in the vertical direction is added for data enhancement.

The significance of the part lies in enhancing data and overcoming some ill-conditioned problems existing in the stereo matching problem.

(3) And (4) extracting point (line) features. At present, the mainstream stereo matching method is difficult to have better generalization, and the main reason is that binocular acquisition data has higher requirements and is difficult to fully cover various scenes. The feature detection operator is adopted to provide labels of feature points, and extensive data are trained to improve the generalization capability of the model, so that the model has better universality on unseen scenes.

The significance of the part is to provide labels for the feature point extraction network, so that the feature point extraction network can be trained on extensive data to generate a model with generalization capability.

(4) And (5) extracting the characteristic points and training the network. The extracted features of the part are divided into two parts, one part is used as a hidden vector h and is input into an iteration updating module, and the other part is directly input into the iteration updating module.

The significance of the part lies in that the characteristic point extraction network is trained by using data covering rich scenes, the universality and the effectiveness of characteristic extraction are improved, and the generalization capability of the model is improved.

(5) And (5) unary feature extraction and construction of four-dimensional matching cost. The unary feature extraction module is based on a resnet structure, and the unary feature refers to a feature vector for describing each pixel point, because the pixel point is taken as a unit in stereo matching, and the goal is to find a corresponding pixel point in a right image for all pixel points in a mapping. And after the unitary features are extracted, multiplying the unitary features between each pixel point in the left graph and the pixel points in the specified parallax range in the right graph to construct and obtain the four-dimensional matching cost of the dimension position c x d x h x w. Where c denotes the number of channels and d denotes the parallax range (d ═ d)_max-d_min) H denotes an image height, and w denotes an image width.

(6) And the iteration network module acquires a coarse-precision disparity map. D in the mainstream Process_maxThe maximum value of parallax between the left image and the right image is represented, the calculation amount is large during parallax aggregation, the iterative module is used for obtaining the parallax value with the precision of several pixels, and the d with the small range is obtained_min,d_max. The iterative module can remarkably reduce the search range and improve the execution speed of the algorithm.

(7) The deformable convolution module aggregates the unary features. In the mainstream method, the 3D convolution is used for feature aggregation, however, the 3D convolution requires a large memory of the graphics card, and the calculation time is also long. We introduce a deformable convolution, outputting a parallax shift Δ p using two independent convolution layers_kAnd a weight m_kThe loss after polymerization is of the formula (1)

Where K represents the number of sampling points, ω_kRepresenting the weight resulting from the softmax operation.

(8) And (6) parallax regression. For the matching cost c after aggregation_dTaking a negative number (for each disparity d), then converting the matching cost into probability through softmax (here, sigma) operation, and then performing weighted summation on each disparity value, wherein the weight is a probability value, as shown in formula (2). Here, we merge point features and line features into the loss function to obtain the disparity map quickly and robustly.

(9) And (5) refining the parallax. And the parallax thinning module is used for sampling the obtained parallax map by combining the characteristics of the input map to obtain the parallax map with the resolution consistent with the resolution of the original map.

Although the present application has been described above with reference to specific embodiments, those skilled in the art will recognize that many changes may be made in the configuration and details of the present application within the principles and scope of the present application. The scope of protection of the application is determined by the appended claims, and all changes that come within the meaning and range of equivalency of the technical features are intended to be embraced therein.

Claims

1. An image stereo matching method is characterized in that: the method comprises the following steps:

step 1: constructing a training image library;

step 2: enhancing each training image;

and step 3: training a point or line feature extraction network on a single graph;

and 4, step 4: extracting a network from the feature points or lines for training;

and 5: performing unitary feature extraction on the image to construct four-dimensional matching cost;

step 6: an iterative network module obtains a coarse precision disparity map;

and 8: performing parallax regression;

and step 9: and (5) refining the parallax.

2. The image stereo matching method according to claim 1, characterized in that: the binocular data and the monocular data of the image library in the step 1; the binocular image comprises a left image and a right image, the binocular data are obtained from a data set, and the data set comprises a Sceneflow synthetic data set, a Kitti automatic driving data set, a Middlebury data set, an Eth3d data set and other binocular data sets; the monocular data is obtained from a dataset comprising an Imagenet image dataset, a Coco image dataset.

3. The image stereo matching method according to claim 1, characterized in that: the enhancing in step 2 comprises applying random brightness variations, occlusions or random offsets in the vertical direction.

4. The image stereo matching method according to claim 1, wherein the extraction of the point features in step 3 employs a conventional feature detection operator, the extraction of the line features puts regions obtained by line feature detection together into a range of feature points, and the extracted feature points are used for subsequent training of a feature point extraction network.

5. The image stereo matching method according to claim 1, characterized in that: the feature point extraction module in the step 4 comprises a series of 2D convolution layers and pooling layers, and labels used for training are feature points extracted by a traditional operator, including points extracted by a point feature and line method; the partial feature point extraction module can also directly use the traditional operator, and the dimensions of the unary feature descriptors generated by the two methods are required to be kept consistent;

the features generated by the feature point extraction module in the step 4 are divided into two parts, one part is used as a hidden vector h and is input into the iteration module, and the other part is directly input into the iteration module.

6. The image stereo matching method according to claim 1, characterized in that: the unary feature extraction module in the step 5 is based on a resnet structure.

7. The image stereo matching method according to claim 1, characterized in that: the deformable convolution module in step 7 aggregates unary features, where the feature aggregation is to bind the disparity value of a single pixel to some extent with the disparity values of other pixels in the image.

8. The image stereo matching method according to claim 1, characterized in that: the regression of the parallax in the step 8 is to find each parallax value d in the range of the parallax_iWeighted sum with its corresponding weight, its disparity value d_iIs d_minTo d_maxAnd calculating probability values of the matching costs in the parallax range by positive integers, and weighting and summing to obtain corresponding parallax values.

9. The image stereo matching method according to claim 1, characterized in that: the parallax refinement in the step 9 further refines the obtained parallax value by combining the input image, the network can output a parallax image with low resolution under the condition of pursuing timeliness, and the refinement module performs up-sampling on the obtained parallax image by combining the original input image to obtain a parallax image with the resolution consistent with that of the original input image.

10. The image stereo matching method according to any one of claims 1 to 9, characterized by: and performing network training by jointly using the synthetic data and the real data, performing feature extraction network training by using richer monocular data, and performing unary feature fusion by using deformable convolution.