CN116777971A - Binocular stereo matching method based on horizontal deformable attention module - Google Patents

Binocular stereo matching method based on horizontal deformable attention module Download PDF

Info

Publication number
CN116777971A
CN116777971A CN202310614417.6A CN202310614417A CN116777971A CN 116777971 A CN116777971 A CN 116777971A CN 202310614417 A CN202310614417 A CN 202310614417A CN 116777971 A CN116777971 A CN 116777971A
Authority
CN
China
Prior art keywords
horizontal
deformable
attention
stereo matching
attention module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310614417.6A
Other languages
Chinese (zh)
Inventor
李保平
陈娜
杨飞
李晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202310614417.6A priority Critical patent/CN116777971A/en
Publication of CN116777971A publication Critical patent/CN116777971A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a binocular stereo matching method based on a horizontal deformable attention module, and belongs to the field of image processing. The invention aims to solve the problems that the computational complexity of a non-local attention mechanism is high, unnecessary or wrong context information is possibly introduced for a stereo matching task, and horizontal constraint and parallax continuity constraint existing in the stereo matching task are not considered. The invention improves the quality of the parallax map.

Description

Binocular stereo matching method based on horizontal deformable attention module
Technical Field
The invention belongs to the field of image processing, and particularly relates to a binocular stereo matching method based on a horizontal deformable attention module.
Background
Stereo matching is a fundamental and challenging task in computer vision, which has wide application in autopilot, dense reconstruction, and other depth-related tasks. The stereo matching aims at calculating parallax or depth information corresponding to each pixel point in the images according to two or more images with different visual angles. The difficulty of stereo matching is how to find the correspondence between images accurately in areas where adverse conditions such as texture loss, occlusion, illumination change, etc. exist. In order to solve this problem, a number of stereo matching methods based on deep learning have appeared in recent years, and they generally adopt an end-to-end network structure, including feature extraction, cost calculation, cost aggregation, parallax regression and other modules. The feature extraction module is used for extracting high-level semantic features from the input image, the cost calculation module is used for constructing a cost volume according to similarity or difference between the features, the cost aggregation module is used for regularizing and optimizing the cost volume, and the parallax regression module is used for generating a final parallax image from the optimized cost volume. Among these modules, the cost aggregation module is one of the key factors affecting stereo matching performance, which requires the full use of context information to disambiguate and noise. To capture contextual information, one common approach is to use a non-local attention mechanism, which can calculate the correlation between any two locations by a self-attention mechanism, and update the features of each location according to the correlation weight.
However, the non-local attention mechanism suffers from two problems: firstly, the computational complexity is high, because it requires global operation on the whole feature map; secondly, for stereo matching tasks, non-local attention mechanisms may introduce unnecessary or erroneous context information and do not take into account the horizontal constraints and parallax continuity constraints present in the stereo matching task.
Disclosure of Invention
First, the technical problem to be solved
The technical problem to be solved by the invention is how to provide a binocular stereo matching method based on a horizontal deformable attention module so as to solve the problems that a non-local attention mechanism is high in computational complexity, unnecessary or wrong context information can be introduced for a stereo matching task, and horizontal constraint and parallax continuity constraint existing in the stereo matching task are not considered.
(II) technical scheme
In order to solve the technical problems, the invention provides a binocular stereo matching method based on a horizontal deformable attention module, which comprises the following steps:
s1, inputting the characteristics of a left view extracted by a first ResNet backbone network into a left view of a binocular stereo matching method, and inputting the characteristics of a right view extracted by a second ResNet backbone network into a right view of the binocular stereo matching method;
s2, inputting the characteristics output by the first ResNet backbone network into a first horizontal deformable attention module for further characteristic processing, and inputting the characteristics output by the second ResNet backbone network into a second horizontal deformable attention module for further characteristic processing;
s3, the characteristics processed by the first horizontal deformable attention module and the second horizontal deformable attention module form a matching cost body through characteristic cascade, and then a final parallax image is obtained through three-dimensional convolution parallax regression;
the first horizontal deformable attention module and the second horizontal deformable attention module are the same horizontal deformable attention module, and both the first horizontal deformable attention module and the second horizontal deformable attention module comprise a horizontal attention mechanism and a deformation convolution module.
(III) beneficial effects
Compared with the prior art, the binocular stereo matching method based on the horizontal deformable attention module is provided, the method provided by the invention constructs the horizontal deformable attention module, wherein a horizontal attention mechanism introduces polar line constraint, so that the similarity of pixel characteristics of a left view and a right view in the horizontal direction is better learned, the similarity of the same pixel in the left view and the right view can be improved, and meanwhile, the deformable convolution can better utilize context characteristic information to optimize textures of a disparity map, thereby improving the quality of the disparity map.
Drawings
FIG. 1 is a diagram of a binocular stereo matching method architecture based on a horizontal deformable attention module of the present invention (the inside of the dashed line frame is a detailed block diagram of the horizontal deformable attention module of the present invention);
FIG. 2 is a detailed block diagram of a horizontally deformable attention module of the present invention;
FIG. 3 is a graph comparing the effects of horizontal attention mechanisms.
Detailed Description
To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.
In order to better adapt to the characteristics of stereo matching tasks, the invention aims to provide a stereo matching method based on a horizontal deformable local attention module, which can effectively capture global corresponding clues in the horizontal direction and adjust the distribution of context information in a parallax discontinuous region through deformable convolution, thereby improving the accuracy and the robustness of stereo matching.
Fig. 1 is a neural network architecture diagram of a binocular stereo matching method based on a horizontal deformable attention module. The horizontal deformable attention module mainly processes the features of left and right views extracted by the ResNet backbone network, further processes the features, the processed features form a matching cost body through feature cascade, and then a final parallax map is obtained through three-dimensional convolution parallax regression. The binocular stereo matching method based on the horizontal deformable attention module comprises the following steps of:
s1, inputting the characteristics of a left view extracted by a first ResNet backbone network into a left view of a binocular stereo matching method, and inputting the characteristics of a right view extracted by a second ResNet backbone network into a right view of the binocular stereo matching method;
s2, inputting the characteristics output by the first ResNet backbone network into a first horizontal deformable attention module for further characteristic processing, and inputting the characteristics output by the second ResNet backbone network into a second horizontal deformable attention module for further characteristic processing;
s3, the characteristics processed by the first horizontal deformable attention module and the second horizontal deformable attention module form a matching cost body through characteristic cascade, and then a final parallax map is obtained through three-dimensional convolution parallax regression.
The first horizontal deformable attention module and the second horizontal deformable attention module are the same horizontal deformable attention module, and both the first horizontal deformable attention module and the second horizontal deformable attention module comprise a horizontal attention mechanism and a deformation convolution module.
Wherein the first and second ResNet backbone networks are the same ResNet backbone network.
The neural network structure of fig. 1 performs parameter sharing between a first res net backbone network and a second res net backbone network which process left and right views, and performs parameter sharing between a first horizontal deformable attention module and a second horizontal deformable attention module which process left and right view features. Sharing refers to the fact that parameters of two shared neural network modules are consistent in the training or reasoning process, and is a strategy in the neural network training process.
Wherein the horizontal deformable attention module is the proposed invention, fig. 2 is a detailed structural diagram of the horizontal deformable attention module.
The horizontal attention mechanism focuses only the local features of pixels in line with the center pixel, and can capture more pairs of feature matching efficient features using epipolar constraints, the process of adaptively aggregating horizontal spatial depth context features is described below.
The left part in FIG. 2 is the horizontal attention mechanism, representing the input meta-feature (Unary feature) as X ε R C×H×W Where C, H and W are the number of channels, the spatial height and the spatial width, respectively. First, convolution processing is carried out on X by using 3 convolution layers f_query, f_key and f_value with 1X 1 convolution kernel to obtain Q epsilon R respectively C′×H×W ,K∈R C′×H×W And V.epsilon.R C ×H×W Wherein C' =c/2. The convolution kernel parameters of f_query and f_key are shared, the channel number of the output characteristic is reduced from 320 to 128, and the consumption of calculation and memory is reduced. The three output characteristics are then usedRespectively processing, and performing tensor flattening and transposition processing on Q to obtainWhere n=h×w. The K and the V are subjected to matrix variable dimension resampling processing to obtainAnd-> And->The attention pattern (Attentionmap) Y E R is obtained through softmax processing after the correlation matrix calculation N×W Expressed as:
wherein y is J,I Representation ofAt position I and->In J, where n=n×w, the more similar features in stereo matching represent the more reliable disparity between two spatial points, the aggregation between similar depth features can achieve mutual gain, then in attention map (attention map) and value feature->The correlation moment is carried out betweenMatrix operation, obtaining the result Y' E R after the matrix is changed into dimension again C×H×W Finally, the contextual characteristic information Y' and the input unitary characteristic X are summed to obtain the final output A E R of the characteristic processing stage C×H×W Expressed as
A=sum(αY',X)
Where α is a learnable parameter used to adjust the operational weight of the attention mechanism, which tends to capture global context information, is more efficient for depth consistency of non-textured regions and individual objects by direct addition of attention features to input features. Although the horizontal attention mechanism focuses only on features in the horizontal direction, feature X after extraction through the res net backbone network has a more trusted global receptive field, which can provide sufficient global context information.
Although the attention mechanism can improve the boundary processing effect of the object in the parallax map to a certain extent, as shown in fig. 3, the parallax map boundary after the horizontal attention mechanism is used (the third row in fig. 3) is more obvious than the parallax map boundary without the horizontal attention mechanism (the second row in fig. 3), but the parallax is still not accurate enough, so that the processing of the horizontal attention mechanism features by using the deformation convolution is considered to improve the object boundary processing effect in the parallax map, the processing of the feature a after the processing of the horizontal attention mechanism is described in the right part in fig. 2 by using the deformation convolution module, and the deformation convolution module can adaptively sample the feature information of the depth position similar to the central pixel by learning the spatial convolution offset. The deformation convolution module in the method comprises 1×1 two-dimensional convolution and 1×3 deformable convolution and 1×1 two-dimensional convolution, which can be expressed as,
A'=conv2d(deformconv2d(conv2d(A)))
where conv2d represents a 1×1 two-dimensional convolution and deformconv2d represents a 3×3 deformable convolution. A' is the feature after processing by the horizontal deformable attention module. The effect of the fourth behavior in fig. 3 using a horizontal deformable attention module is a further improvement in the accuracy of object boundary parallax over methods using only horizontal attention mechanisms.
The horizontal deformable attention module and the horizontal deformable attention module combined by the horizontal attention mechanism and the deformation convolution module are all in the protection scope of the invention.
Example 1:
the ablation experiment proves that the quality of the generated disparity map can be improved by a horizontal attention mechanism and the deformable convolution. In order to verify the effectiveness of the method, a comparison test is carried out, horizontal attention mechanism processing and deformable convolution processing are added to a feature extraction part in other parts of a network structure of a binocular stereo matching algorithm based on deep learning, the test uses four methods of non-horizontal attention mechanism non-deformable convolution, non-horizontal attention mechanism deformable convolution and horizontal attention mechanism deformable convolution, the test results are shown in table 1, the adopted GPU equipment is Geforce RTX2080Ti, the input image resolution is 256 multiplied by 512, and the test is carried out on a SceneF low data set.
Table 1 experimental comparative results
Compared with the prior art, the method provided by the invention constructs the horizontal deformable attention module, wherein a horizontal attention mechanism introduces polar constraint, so that the similarity of pixel characteristics of the left view and the right view in the horizontal direction is better learned, the similarity matching degree of the same pixel in the left view and the right view can be improved, and meanwhile, the deformable convolution can better utilize the context characteristic information to optimize the texture of the disparity map, so that the quality of the disparity map is improved.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. A binocular stereo matching method based on a horizontal deformable attention module, the method comprising:
s1, inputting the characteristics of a left view extracted by a first ResNet backbone network into a left view of a binocular stereo matching method, and inputting the characteristics of a right view extracted by a second ResNet backbone network into a right view of the binocular stereo matching method;
s2, inputting the characteristics output by the first ResNet backbone network into a first horizontal deformable attention module for further characteristic processing, and inputting the characteristics output by the second ResNet backbone network into a second horizontal deformable attention module for further characteristic processing;
s3, the characteristics processed by the first horizontal deformable attention module and the second horizontal deformable attention module form a matching cost body through characteristic cascade, and then a final parallax image is obtained through three-dimensional convolution parallax regression;
the first horizontal deformable attention module and the second horizontal deformable attention module are the same horizontal deformable attention module, and both the first horizontal deformable attention module and the second horizontal deformable attention module comprise a horizontal attention mechanism and a deformation convolution module.
2. The binocular stereo matching method based on the horizontal deformable attention module of claim 1, wherein the first and second res net backbone networks are the same res net backbone network.
3. The binocular stereo matching method based on the horizontal deformable attention module of claim 1, wherein parameter sharing is performed between the first res net backbone network and the second res net backbone network.
4. The binocular stereo matching method based on horizontal deformable attention modules of claim 1, wherein parameter sharing is performed between the first horizontal deformable attention module and the second horizontal deformable attention module.
5. A binocular stereo matching method based on a horizontal deformable attention module according to any of claims 1-4, wherein the horizontal attention mechanism comprises in particular:
representing input meta-features (Unary features) as X ε R C×H×W Wherein C, H and W are the number of channels, the spatial height and the spatial width, respectively;
first, X is convolved with 3 convolution layers f_query, f_key, and f_value with a convolution kernel of 1×1 to obtain Q εR, respectively C′×H×W ,K∈R C′×H×W And V.epsilon.R C×H×W Wherein C' =c/2;
then, the three output characteristics are processed respectively, and Q is tensor flattened and transposed to obtainWherein n=h×w; performing matrix variable dimension resampling on K and V to obtain +.> And->
And->The Attention map Y E R is obtained through softmax processing after the correlation matrix calculation N ×W Expressed as:
wherein y is J,I Representation ofAt position I and->The more similar features in stereo matching represent the more reliable the parallax between two spatial points, the aggregation between similar depth features achieves mutual gain;
in Attention map (Attention map) and value featurePerforming correlation matrix operation to obtain a result Y' E R after the matrix is changed into dimension C×H×W Finally, the contextual characteristic information Y' and the input unitary characteristic X are summed to obtain the final output A E R of the characteristic processing stage C×H×W Expressed as:
A=sum(αY',X)
where α is a learnable parameter used to adjust the operational weight of the attention mechanism.
6. The binocular stereo matching method of claim 5, wherein the convolution kernel parameters of f_query and f_key are shared and the number of channels of the output features is reduced from 320 to 128.
7. The binocular stereo matching method of claim 5, wherein the input Unary feature (Unary feature) X is a feature extracted through the res net backbone network, has a more reliable global receptive field, and can provide enough global context information.
8. The binocular stereo matching method of claim 5, wherein the horizontal attention mechanism focuses on local features of pixels in line with the center pixel, capturing more pairs of feature matching valid features using epipolar constraints.
9. The binocular stereo matching method based on the horizontal deformable attention module of claim 5, wherein the deformable convolution module comprises 1 x 1 two-dimensional convolution and 1 x 3 deformable convolution and 1 x 1 two-dimensional convolution components, expressed as,
A'=conv2d(deformconv2d(conv2d(A)))
where conv2d represents a 1 x 1 two-dimensional convolution and deformconv2d represents a 3 x 3 deformable convolution, a' being the feature after processing by the horizontal deformable attention module.
10. The binocular stereo matching method based on the horizontal deformable attention module of claim 9, wherein the feature a after the horizontal attention mechanism processing is processed by a deformation convolution module, and the deformation convolution module can adaptively sample feature information of a depth position similar to the center pixel by learning a spatial convolution offset.
CN202310614417.6A 2023-05-29 2023-05-29 Binocular stereo matching method based on horizontal deformable attention module Pending CN116777971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310614417.6A CN116777971A (en) 2023-05-29 2023-05-29 Binocular stereo matching method based on horizontal deformable attention module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310614417.6A CN116777971A (en) 2023-05-29 2023-05-29 Binocular stereo matching method based on horizontal deformable attention module

Publications (1)

Publication Number Publication Date
CN116777971A true CN116777971A (en) 2023-09-19

Family

ID=87987045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310614417.6A Pending CN116777971A (en) 2023-05-29 2023-05-29 Binocular stereo matching method based on horizontal deformable attention module

Country Status (1)

Country Link
CN (1) CN116777971A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058252A (en) * 2023-10-12 2023-11-14 东莞市爱培科技术有限公司 Self-adaptive fusion stereo matching algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058252A (en) * 2023-10-12 2023-11-14 东莞市爱培科技术有限公司 Self-adaptive fusion stereo matching algorithm
CN117058252B (en) * 2023-10-12 2023-12-26 东莞市爱培科技术有限公司 Self-adaptive fusion stereo matching method

Similar Documents

Publication Publication Date Title
CN110084757B (en) Infrared depth image enhancement method based on generation countermeasure network
CN111259945B (en) Binocular parallax estimation method introducing attention map
CN115690324A (en) Neural radiation field reconstruction optimization method and device based on point cloud
Xia et al. Identifying recurring patterns with deep neural networks for natural image denoising
US11170202B2 (en) Apparatus and method for performing 3D estimation based on locally determined 3D information hypotheses
Yue et al. CID: Combined image denoising in spatial and frequency domains using Web images
CN111275643A (en) True noise blind denoising network model and method based on channel and space attention
CN103996201A (en) Stereo matching method based on improved gradient and adaptive window
Jiang et al. Learning a referenceless stereopair quality engine with deep nonnegativity constrained sparse autoencoder
CN116777971A (en) Binocular stereo matching method based on horizontal deformable attention module
CN115239870A (en) Multi-view stereo network three-dimensional reconstruction method based on attention cost body pyramid
CN116912405A (en) Three-dimensional reconstruction method and system based on improved MVSNet
Liu et al. APSNet: Toward adaptive point sampling for efficient 3D action recognition
CN113538569A (en) Weak texture object pose estimation method and system
Liu et al. Temporal consistency learning of inter-frames for video super-resolution
CN115115685A (en) Monocular image depth estimation algorithm based on self-attention neural network
CN111553296A (en) Two-value neural network stereo vision matching method based on FPGA
CN111582437A (en) Construction method of parallax regression deep neural network
CN214587004U (en) Stereo matching acceleration circuit, image processor and three-dimensional imaging electronic equipment
Hou et al. Joint learning of image deblurring and depth estimation through adversarial multi-task network
Li et al. Graph-based saliency fusion with superpixel-level belief propagation for 3D fixation prediction
Liu et al. Dnt: Learning unsupervised denoising transformer from single noisy image
CN112634128B (en) Stereo image redirection method based on deep learning
Rezayi et al. Huber Markov random field for joint super resolution
Bae et al. Efficient and scalable view generation from a single image using fully convolutional networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination