CN116777971A

CN116777971A - Binocular stereo matching method based on horizontal deformable attention module

Info

Publication number: CN116777971A
Application number: CN202310614417.6A
Authority: CN
Inventors: 李保平; 陈娜; 杨飞; 李晖
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-09-19

Abstract

The invention relates to a binocular stereo matching method based on a horizontal deformable attention module, and belongs to the field of image processing. The invention aims to solve the problems that the computational complexity of a non-local attention mechanism is high, unnecessary or wrong context information is possibly introduced for a stereo matching task, and horizontal constraint and parallax continuity constraint existing in the stereo matching task are not considered. The invention improves the quality of the parallax map.

Description

Binocular stereo matching method based on horizontal deformable attention module

Technical Field

The invention belongs to the field of image processing, and particularly relates to a binocular stereo matching method based on a horizontal deformable attention module.

Background

Stereo matching is a fundamental and challenging task in computer vision, which has wide application in autopilot, dense reconstruction, and other depth-related tasks. The stereo matching aims at calculating parallax or depth information corresponding to each pixel point in the images according to two or more images with different visual angles. The difficulty of stereo matching is how to find the correspondence between images accurately in areas where adverse conditions such as texture loss, occlusion, illumination change, etc. exist. In order to solve this problem, a number of stereo matching methods based on deep learning have appeared in recent years, and they generally adopt an end-to-end network structure, including feature extraction, cost calculation, cost aggregation, parallax regression and other modules. The feature extraction module is used for extracting high-level semantic features from the input image, the cost calculation module is used for constructing a cost volume according to similarity or difference between the features, the cost aggregation module is used for regularizing and optimizing the cost volume, and the parallax regression module is used for generating a final parallax image from the optimized cost volume. Among these modules, the cost aggregation module is one of the key factors affecting stereo matching performance, which requires the full use of context information to disambiguate and noise. To capture contextual information, one common approach is to use a non-local attention mechanism, which can calculate the correlation between any two locations by a self-attention mechanism, and update the features of each location according to the correlation weight.

However, the non-local attention mechanism suffers from two problems: firstly, the computational complexity is high, because it requires global operation on the whole feature map; secondly, for stereo matching tasks, non-local attention mechanisms may introduce unnecessary or erroneous context information and do not take into account the horizontal constraints and parallax continuity constraints present in the stereo matching task.

Disclosure of Invention

First, the technical problem to be solved

The technical problem to be solved by the invention is how to provide a binocular stereo matching method based on a horizontal deformable attention module so as to solve the problems that a non-local attention mechanism is high in computational complexity, unnecessary or wrong context information can be introduced for a stereo matching task, and horizontal constraint and parallax continuity constraint existing in the stereo matching task are not considered.

(II) technical scheme

In order to solve the technical problems, the invention provides a binocular stereo matching method based on a horizontal deformable attention module, which comprises the following steps:

s1, inputting the characteristics of a left view extracted by a first ResNet backbone network into a left view of a binocular stereo matching method, and inputting the characteristics of a right view extracted by a second ResNet backbone network into a right view of the binocular stereo matching method;

s2, inputting the characteristics output by the first ResNet backbone network into a first horizontal deformable attention module for further characteristic processing, and inputting the characteristics output by the second ResNet backbone network into a second horizontal deformable attention module for further characteristic processing;

s3, the characteristics processed by the first horizontal deformable attention module and the second horizontal deformable attention module form a matching cost body through characteristic cascade, and then a final parallax image is obtained through three-dimensional convolution parallax regression;

the first horizontal deformable attention module and the second horizontal deformable attention module are the same horizontal deformable attention module, and both the first horizontal deformable attention module and the second horizontal deformable attention module comprise a horizontal attention mechanism and a deformation convolution module.

(III) beneficial effects

Compared with the prior art, the binocular stereo matching method based on the horizontal deformable attention module is provided, the method provided by the invention constructs the horizontal deformable attention module, wherein a horizontal attention mechanism introduces polar line constraint, so that the similarity of pixel characteristics of a left view and a right view in the horizontal direction is better learned, the similarity of the same pixel in the left view and the right view can be improved, and meanwhile, the deformable convolution can better utilize context characteristic information to optimize textures of a disparity map, thereby improving the quality of the disparity map.

Drawings

FIG. 1 is a diagram of a binocular stereo matching method architecture based on a horizontal deformable attention module of the present invention (the inside of the dashed line frame is a detailed block diagram of the horizontal deformable attention module of the present invention);

FIG. 2 is a detailed block diagram of a horizontally deformable attention module of the present invention;

FIG. 3 is a graph comparing the effects of horizontal attention mechanisms.

Detailed Description

To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.

In order to better adapt to the characteristics of stereo matching tasks, the invention aims to provide a stereo matching method based on a horizontal deformable local attention module, which can effectively capture global corresponding clues in the horizontal direction and adjust the distribution of context information in a parallax discontinuous region through deformable convolution, thereby improving the accuracy and the robustness of stereo matching.

Fig. 1 is a neural network architecture diagram of a binocular stereo matching method based on a horizontal deformable attention module. The horizontal deformable attention module mainly processes the features of left and right views extracted by the ResNet backbone network, further processes the features, the processed features form a matching cost body through feature cascade, and then a final parallax map is obtained through three-dimensional convolution parallax regression. The binocular stereo matching method based on the horizontal deformable attention module comprises the following steps of:

s3, the characteristics processed by the first horizontal deformable attention module and the second horizontal deformable attention module form a matching cost body through characteristic cascade, and then a final parallax map is obtained through three-dimensional convolution parallax regression.

Wherein the first and second ResNet backbone networks are the same ResNet backbone network.

The neural network structure of fig. 1 performs parameter sharing between a first res net backbone network and a second res net backbone network which process left and right views, and performs parameter sharing between a first horizontal deformable attention module and a second horizontal deformable attention module which process left and right view features. Sharing refers to the fact that parameters of two shared neural network modules are consistent in the training or reasoning process, and is a strategy in the neural network training process.

Wherein the horizontal deformable attention module is the proposed invention, fig. 2 is a detailed structural diagram of the horizontal deformable attention module.

The horizontal attention mechanism focuses only the local features of pixels in line with the center pixel, and can capture more pairs of feature matching efficient features using epipolar constraints, the process of adaptively aggregating horizontal spatial depth context features is described below.

The left part in FIG. 2 is the horizontal attention mechanism, representing the input meta-feature (Unary feature) as X ε R ^C×H×W Where C, H and W are the number of channels, the spatial height and the spatial width, respectively. First, convolution processing is carried out on X by using 3 convolution layers f_query, f_key and f_value with 1X 1 convolution kernel to obtain Q epsilon R respectively ^C′×H×W ，K∈R ^C′×H×W And V.epsilon.R ^C ^×H×W Wherein C' =c/2. The convolution kernel parameters of f_query and f_key are shared, the channel number of the output characteristic is reduced from 320 to 128, and the consumption of calculation and memory is reduced. The three output characteristics are then usedRespectively processing, and performing tensor flattening and transposition processing on Q to obtainWhere n=h×w. The K and the V are subjected to matrix variable dimension resampling processing to obtainAnd-> And->The attention pattern (Attentionmap) Y E R is obtained through softmax processing after the correlation matrix calculation ^N×W Expressed as:

wherein y is _J,I Representation ofAt position I and->In J, where n=n×w, the more similar features in stereo matching represent the more reliable disparity between two spatial points, the aggregation between similar depth features can achieve mutual gain, then in attention map (attention map) and value feature->The correlation moment is carried out betweenMatrix operation, obtaining the result Y' E R after the matrix is changed into dimension again ^C×H×W Finally, the contextual characteristic information Y' and the input unitary characteristic X are summed to obtain the final output A E R of the characteristic processing stage ^C×H×W Expressed as

A＝sum(αY',X)

Where α is a learnable parameter used to adjust the operational weight of the attention mechanism, which tends to capture global context information, is more efficient for depth consistency of non-textured regions and individual objects by direct addition of attention features to input features. Although the horizontal attention mechanism focuses only on features in the horizontal direction, feature X after extraction through the res net backbone network has a more trusted global receptive field, which can provide sufficient global context information.

Although the attention mechanism can improve the boundary processing effect of the object in the parallax map to a certain extent, as shown in fig. 3, the parallax map boundary after the horizontal attention mechanism is used (the third row in fig. 3) is more obvious than the parallax map boundary without the horizontal attention mechanism (the second row in fig. 3), but the parallax is still not accurate enough, so that the processing of the horizontal attention mechanism features by using the deformation convolution is considered to improve the object boundary processing effect in the parallax map, the processing of the feature a after the processing of the horizontal attention mechanism is described in the right part in fig. 2 by using the deformation convolution module, and the deformation convolution module can adaptively sample the feature information of the depth position similar to the central pixel by learning the spatial convolution offset. The deformation convolution module in the method comprises 1×1 two-dimensional convolution and 1×3 deformable convolution and 1×1 two-dimensional convolution, which can be expressed as,

A'＝conv2d(deformconv2d(conv2d(A)))

where conv2d represents a 1×1 two-dimensional convolution and deformconv2d represents a 3×3 deformable convolution. A' is the feature after processing by the horizontal deformable attention module. The effect of the fourth behavior in fig. 3 using a horizontal deformable attention module is a further improvement in the accuracy of object boundary parallax over methods using only horizontal attention mechanisms.

The horizontal deformable attention module and the horizontal deformable attention module combined by the horizontal attention mechanism and the deformation convolution module are all in the protection scope of the invention.

Example 1:

the ablation experiment proves that the quality of the generated disparity map can be improved by a horizontal attention mechanism and the deformable convolution. In order to verify the effectiveness of the method, a comparison test is carried out, horizontal attention mechanism processing and deformable convolution processing are added to a feature extraction part in other parts of a network structure of a binocular stereo matching algorithm based on deep learning, the test uses four methods of non-horizontal attention mechanism non-deformable convolution, non-horizontal attention mechanism deformable convolution and horizontal attention mechanism deformable convolution, the test results are shown in table 1, the adopted GPU equipment is Geforce RTX2080Ti, the input image resolution is 256 multiplied by 512, and the test is carried out on a SceneF low data set.

Table 1 experimental comparative results

Compared with the prior art, the method provided by the invention constructs the horizontal deformable attention module, wherein a horizontal attention mechanism introduces polar constraint, so that the similarity of pixel characteristics of the left view and the right view in the horizontal direction is better learned, the similarity matching degree of the same pixel in the left view and the right view can be improved, and meanwhile, the deformable convolution can better utilize the context characteristic information to optimize the texture of the disparity map, so that the quality of the disparity map is improved.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A binocular stereo matching method based on a horizontal deformable attention module, the method comprising:

2. The binocular stereo matching method based on the horizontal deformable attention module of claim 1, wherein the first and second res net backbone networks are the same res net backbone network.

3. The binocular stereo matching method based on the horizontal deformable attention module of claim 1, wherein parameter sharing is performed between the first res net backbone network and the second res net backbone network.

4. The binocular stereo matching method based on horizontal deformable attention modules of claim 1, wherein parameter sharing is performed between the first horizontal deformable attention module and the second horizontal deformable attention module.

5. A binocular stereo matching method based on a horizontal deformable attention module according to any of claims 1-4, wherein the horizontal attention mechanism comprises in particular:

representing input meta-features (Unary features) as X ε R ^C×H×W Wherein C, H and W are the number of channels, the spatial height and the spatial width, respectively;

first, X is convolved with 3 convolution layers f_query, f_key, and f_value with a convolution kernel of 1×1 to obtain Q εR, respectively ^C′×H×W ，K∈R ^C′×H×W And V.epsilon.R ^C×H×W Wherein C' =c/2;

then, the three output characteristics are processed respectively, and Q is tensor flattened and transposed to obtainWherein n=h×w; performing matrix variable dimension resampling on K and V to obtain +.> And->

And->The Attention map Y E R is obtained through softmax processing after the correlation matrix calculation ^N ^×W Expressed as:

wherein y is _J,I Representation ofAt position I and->The more similar features in stereo matching represent the more reliable the parallax between two spatial points, the aggregation between similar depth features achieves mutual gain;

in Attention map (Attention map) and value featurePerforming correlation matrix operation to obtain a result Y' E R after the matrix is changed into dimension ^C×H×W Finally, the contextual characteristic information Y' and the input unitary characteristic X are summed to obtain the final output A E R of the characteristic processing stage ^C×H×W Expressed as:

A＝sum(αY',X)

where α is a learnable parameter used to adjust the operational weight of the attention mechanism.

6. The binocular stereo matching method of claim 5, wherein the convolution kernel parameters of f_query and f_key are shared and the number of channels of the output features is reduced from 320 to 128.

7. The binocular stereo matching method of claim 5, wherein the input Unary feature (Unary feature) X is a feature extracted through the res net backbone network, has a more reliable global receptive field, and can provide enough global context information.

8. The binocular stereo matching method of claim 5, wherein the horizontal attention mechanism focuses on local features of pixels in line with the center pixel, capturing more pairs of feature matching valid features using epipolar constraints.

9. The binocular stereo matching method based on the horizontal deformable attention module of claim 5, wherein the deformable convolution module comprises 1 x 1 two-dimensional convolution and 1 x 3 deformable convolution and 1 x 1 two-dimensional convolution components, expressed as,

A'＝conv2d(deformconv2d(conv2d(A)))

where conv2d represents a 1 x 1 two-dimensional convolution and deformconv2d represents a 3 x 3 deformable convolution, a' being the feature after processing by the horizontal deformable attention module.

10. The binocular stereo matching method based on the horizontal deformable attention module of claim 9, wherein the feature a after the horizontal attention mechanism processing is processed by a deformation convolution module, and the deformation convolution module can adaptively sample feature information of a depth position similar to the center pixel by learning a spatial convolution offset.