CN115049739A - Binocular vision stereo matching method based on edge detection - Google Patents

Binocular vision stereo matching method based on edge detection Download PDF

Info

Publication number
CN115049739A
CN115049739A CN202210670652.0A CN202210670652A CN115049739A CN 115049739 A CN115049739 A CN 115049739A CN 202210670652 A CN202210670652 A CN 202210670652A CN 115049739 A CN115049739 A CN 115049739A
Authority
CN
China
Prior art keywords
res
features
feature
pyramid
volume
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210670652.0A
Other languages
Chinese (zh)
Inventor
杨文帮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202210670652.0A priority Critical patent/CN115049739A/en
Publication of CN115049739A publication Critical patent/CN115049739A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • G06T7/85Stereo camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a binocular vision stereo matching method based on edge detection, which comprises the following steps: the method comprises the steps of feature extraction, a correlation pyramid and a GRU updating module, and when the existing Stereo matching algorithm is used for Stereo matching of images, the existing work usually depends on a 3D convolution network to process Stereo cost, and RAFT-Stereo is used as an integral framework, so that only a light-weight cost volume constructed by 2D convolution and single matrix multiplication is needed. In order to solve the problem of no texture and boundary, multitask semantic information is coded through a semantic pyramid, a context coder in RAFT-Stereo is replaced by a coder based on RINDNet through EdgeStereo, feature extraction is carried out, and the method has strong boundary sensing capability. By using an iterative network, we can easily do the accuracy of the efficiency by stopping early. We use multi-level GRU units to maintain hidden states at multiple resolutions, cross-connect, but still generate a single highly discriminative difference update.

Description

Binocular vision stereo matching method based on edge detection
Technical Field
The invention relates to the field of computer binocular stereo vision, in particular to a binocular vision stereo matching method based on edge detection.
Background
It is known that light rays in a scene are collected in a binocular imaging system of a human being and are transmitted to a brain containing hundreds of millions of neurons through a nerve center to be processed in parallel, and real-time, high-definition and accurate depth perception information is obtained.
Binocular stereo vision is an important form of computer vision that obtains three-dimensional information of a scene by simulating binocular vision characteristics. The binocular camera acquires scene information from different directions, and the distance from the corresponding point to the imaging plane is calculated according to the parallax, so that depth perception and three-dimensional reconstruction are obtained. The binocular stereo matching algorithm is always a technical problem to be solved.
Disclosure of Invention
The invention aims to provide a binocular vision stereo matching method based on edge detection so as to solve the technical problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions.
A binocular vision stereo matching method based on edge detection comprises the following steps:
the method comprises the following steps: replacing a context encoder of RAFT-stereo with an encoder based on RINDNet by using an EdgeStereo context integrated residual stereo matching pyramid network;
step two: performing feature extraction on the image by using a feature encoder and an RINDNet-based encoder;
step three: using the dot product between feature vectors as a measure of visual similarity, the computation of the correlation volume is limited to pixels sharing the same y-coordinate, given a feature mapping f, g ∈ is respectively from I L And I R Is prepared by
Figure RE-GDA0003792554090000021
By limiting the inner product computation to feature vectors sharing the same first index. Constructing a 4-layer correlation pyramid by repeating the last dimension of the average pool;
step four: GRU update block step, we predict a series of disparity fields { d from initial starting point d0 ═ 0 1 ,…,d N }. In each iteration, we index the relevant volume using the current estimate of disparity, generating a set of relevant features; these features are delivered through 2 convolutional layers. Similarly, the current disparity estimate also passes through 2 convolutional layers; correlation, difference and rindnnet extraction yields features, which are then concatenated and injected into the GRU. The GRU updates the hidden state. The new hidden state is then used to predict the disparity update.
The binocular vision stereo matching method based on edge detection provided by the invention has the technical advantages that: the multitask semantic information is coded through a semantic pyramid, a context coder in the RAFT-Stereo is replaced by a coder based on RINDNet through the EdgeStereo, feature extraction is carried out, and the method has strong boundary perception capability. By using an iterative network, we can easily do the accuracy of the efficiency by stopping early. We use multi-level GRU units to maintain hidden states at multiple resolutions, cross-connect, but still generate a single highly discriminative difference update.
Drawings
FIG. 1 is a flow chart of the steps of the present invention.
FIG. 2 is a flowchart of a binocular vision stereo matching method according to an embodiment;
FIG. 3 is a one-dimensional grid graph with integer offsets;
FIG. 4 is a cross-connect diagram of GRU inputs;
fig. 5 is a schematic diagram of (a) the weighting layer, (b) the decoder, and (c) the attention module of rindnnet.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted from the various embodiments, or may be replaced with other elements, materials, methods. In some cases, operations related to the present application are not shown or described in the specification, so as to avoid the core part of the present application being overwhelmed by excessive description, and it is not necessary for those skilled in the art to describe these related operations in detail, so that they can fully understand the related operations according to the description in the specification and the general technical knowledge of the present field.
Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, various steps or actions in the description of the method may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.
As shown in fig. 1 to 5, the present invention provides a binocular vision stereo matching method based on edge detection, the method comprising the steps of:
s1: acquiring images under two viewpoints;
s2: the left image and the right image are respectively sent to a feature extractor to extract a dense feature map;
s3: constructing a related cost body according to dot products of feature graphs extracted from the left image and the right image;
s4: constructing a relevant gold tower according to the relevant cost bodies of the two images;
s5: and synchronously updating the characteristic graphs of corresponding respective rates through a multi-level updating module.
Specifically, in the embodiment of the present invention, a pair of corrected images (I) is given L ,I R ) Our goal is to estimate the disparity field d, given I L The horizontal displacement of each pixel in the array. Similar to RAFT-Stereo, our method consists of three main components: feature extractor, correlation pyramid, and GRU based update operator, as shown in fig. 1. The update operator iteratively retrieves features from the relevant pyramid and performs updates on the disparity field.
And the indexing number S in the residual pyramid is consistent with the structure of the encoder. The smallest scale in the remaining pyramid yields the disparity map d s (
Figure RE-GDA0003792554090000041
Full resolution) and then use the residual map r on a larger scale s Upsampling and thinning successively until a full-size disparity map d is obtained 0 . The formula is shown in equation (1), where u (·) denotes upsampling by a factor of 2 and s denotes pyramid scale (e.g., 0 denotes full resolution).
d s =u(d s+1 )+r s ,0≤s<S#(1)
Finally, we regularize the edge mapping as an edge-aware smoothness loss, which is a valid guide for disparity estimation. Smoothness loss for parallax
Figure RE-GDA0003792554090000042
We encourage parallax local smoothing, with a penalty penalizing depth variations in non-edge regions. To account for depth discontinuities in object contours, previous methods weight the regularization term according to image gradients. Instead, we weight the term according to the gradient of the edge map, which is semantically more meaningful than the intensity variation. As shown in the formula (2), N represents the number of pixels,
Figure RE-GDA0003792554090000043
represents the sum of parallax gradients
Figure RE-GDA0003792554090000044
Gradient representing edge probability map
Figure RE-GDA0003792554090000051
In the second phase, we supervise the regression difference on the S scale on the stereo data set. With deep supervision, the total loss is
Figure RE-GDA0003792554090000052
Wherein C is s Representing the loss at the scale s. In addition to the parallax smoothness penalty, we use the parallax regression penalty
Figure RE-GDA0003792554090000053
And (4) performing supervised learning, as shown in formula (3).
Figure RE-GDA0003792554090000054
Wherein the content of the first and second substances,
Figure RE-GDA0003792554090000055
representing a ground truth disparity map. Thus, the total loss at the scale s becomes
Figure RE-GDA0003792554090000056
Wherein λ is ds Is the loss weight of smoothness penalty. In addition, the weights of the edge subnetworks are fixed.
We use two independent feature extractors, called feature encoder and RINDNet based encoder. The feature encoder is applied to the left and right images and maps each image to a dense feature map, which is then used to construct the relevant volume.
The network consists of a series of residual blocks and downsampling layers, generating feature maps for 256 channels at input image resolutions of 1/4 or 1/8, depending on the number of downsampling layers used in our experiment. We use example normalization in feature encoders, while in rindnnet based encoders rindnnet is an end-to-end network, first phase: common features of all edges are extracted. We first extract common features for all edges using the skeleton, since these edges have similar patterns in the intensity variations of the image. The backbone follows the structure of ResNet-50, which consists of five repeating building blocks. Specifically, the feature maps for the five blocks from ResNet-50 are denoted as res 1 、res 2 、res 3 、res 4 And res 5 . Then, IThey generate spatial cues based on the above features.
It is well known that different layers of CNN features encode different levels of look/semantic information and contribute differently to different edge types. Specifically, the underlying elements are mapped to res 1-3 Focus more on lower-level cues (e.g., color, texture, and brightness), while the top-level maps res 4-5 Object awareness information is supported. Therefore, it is beneficial to capture multiple levels of spatial response from different feature map layers. Given a plurality of function maps res 1-5 We obtain a spatial response map:
Figure RE-GDA0003792554090000061
wherein the space response
Figure RE-GDA0003792554090000062
From spatial layers
Figure RE-GDA0003792554090000063
Learning, the space layer is composed of a convolutional layer and a deconvolution layer.
And a second stage: unique features are prepared for the REs/IEs and NEs/DEs.
Then, RIND Net learns the specific features of each edge type separately by the corresponding decoder in phase II.
We have designed a decoder with two streams to recover the fine position information as shown in fig. 4. In the proposed architecture, two stream decoders can work together and learn more powerful functions from different perspectives.
Although the four decoders have the same structure, some special designs are proposed for different types of edges, and we will give a detailed description below. To properly distinguish each type of edge and to better describe our work, we next group these four edge types into two groups, namely REs/IEs and NEs/DEs, for which features are prepared separately.
REs and IEs. In practice, the underlying functionality (e.g., theE.g. res 1-3 ) Capturing detailed intensity changes often reflected in REs and IEs. In addition, the REs and IEs are associated with advanced functions (e.g., REs) 5 ) The provided global context is related to surrounding objects. It is therefore desirable that the semantic cues give the proper guidance of the change in perceived intensity before forwarding to the decoder. Furthermore, it is worth noting that simply connecting low-level and high-level features may be computationally too expensive due to the increased number of parameters. Therefore, we propose that the Weight Layer (WL) adaptively fuses low-level features and high-level hints in a learnable manner, without increasing the dimensionality of the features.
As shown in FIG. 4, the WL contains two paths: the first path receives the advanced features res through the deconvolution layer 5 To restore high resolution, then two 3 x 3 convolutional layers with Batch Normalization (BN) and ReLU mining adaptive semantic cues; the other path is implemented as two convolutional layers with BN and ReLU, which encode the lower layer feature res 1-3 After that, they are fused by element-level multiplication. Formally, the underlying characteristic res is taken into account 1-3 And advanced reminder res 5 We generate fusion signatures for REs and IEs, respectively,
Figure RE-GDA0003792554090000071
wherein WL of the REs and IEs are respectively represented as
Figure RE-GDA0003792554090000072
And
Figure RE-GDA0003792554090000073
g r /g i is a fusion feature of REs/IEs [ ·]Are connected in series. Note that res 3 Has a resolution of less than res 1 And res 2 Thus before feature stitching, at res 3 An up-sampling operation up (-) is used to improve resolution. Next, the fused features are fed into respective decoders to generate specific features with accurate position information for REs and IEs, respectively.
Figure RE-GDA0003792554090000074
Wherein
Figure RE-GDA0003792554090000075
And
Figure RE-GDA0003792554090000076
respectively representing the decoding of the REs and IEs, and f r /f i Is a decoding characteristic diagram of the REs/IEs.
NEs, and DEs. Since high-level features (e.g., res5) express strong semantic responses, which are usually concentrated in NEs and DEs, we obtained specific features of NEs and DEs with res5,
Figure RE-GDA0003792554090000081
wherein NE decoder and DE decoder are respectively represented as
Figure RE-GDA0003792554090000082
And
Figure RE-GDA0003792554090000083
f n /f d is a decoding feature of the NE/DE. Since DEs and NEs typically share some relevant geometric cues, we share weights of the second stream of NE and DE decoders to learn the collaborative geometric cues. Meanwhile, the first streams of the NE decoder and the DE decoder are responsible for learning the specific features of REs and DEs, respectively.
Figure RE-GDA0003792554090000084
Here O r /O i Is the initial prediction of "REs/IEs". Decision heads for the REs and IEs are named separately
Figure RE-GDA0003792554090000085
And
Figure RE-GDA0003792554090000086
modeling was done for 3 × 3 convolutional layers and 1 × 1 convolutional layers. Note that the REs and IEs do not directly rely on location cues provided by the top level, and thus the spatial cues
Figure RE-GDA0003792554090000087
Not for them. In contrast, of all spatial cues
Figure RE-GDA0003792554090000088
In tandem with the decoding feature, initial results for network elements and DEs are generated,
Figure RE-GDA0003792554090000089
wherein the content of the first and second substances,
Figure RE-GDA00037925540900000810
and
Figure RE-GDA00037925540900000811
decision headers for NEs and DEs, respectively, consisting of three 1 × 1 convolutional layers for integrating cues at each location. In short, O ═ O r ,O i ,O n ,O d Denotes the initial result set.
Attention is directed to the module. Finally, rindnnet integrates the initial results with the attention map obtained by the Attention Module (AM) to generate the final results. Since different types of edges are reflected in different positions, it is necessary to pay more attention to the relevant positions when predicting each type of edge. Fortunately, the side annotation provides a label for each location. Thus, the proposed AM can infer spatial relationships between multiple tags under pixel-level supervision through an attention mechanism. The attention map may be used to activate a reaction of the relevant location. Formally, given the input image X, the AM learning space attention,
A={A b ,A r ,A i ,A n ,A d }=soft(ψ att (X)),#(10)
wherein A is the normalized attention by the softmax function, and A b ,A r ,A i ,A n ,A d ∈[0,1] W×H Are attention points corresponding to background, REs, IEs, NEs and DEs, respectively. Obviously, if the label is marked as a pixel, the position of that pixel should be assigned a higher attention value. AM psi att Implemented by the first building block of ResNet, four 3 x 3 convolutional layers (each layer followed by the ReLU and BN operations), and one 1 x 1 convolutional layer, as shown in fig. 4. Finally, the initial result is integrated with the attention map, a final result y is generated,
y=signmoid(O⊙(A {r,i,n,d} )),#(11)
where £ is multiplication by element.
The unique features for optimizing the detection of the four edge types are learned through the above. Rindnnet includes three phases of initial outcome inference: extracting common features, preparing distinguishing features, and generating initial results and final predictions integrated by the attention module.
A correlation pyramid, and a correlation lookup.
Relative volume: we use the dot product between feature vectors as a measure of visual similarity. Similar to how RAFT-stereo constructs 4D correlation volumes by computing visual similarities between all pairs of pixels, we limit the computation of the correlation volumes to pixels sharing the same y-coordinate. Given a feature mapping f, g ∈ respectively from I L And I R Is prepared by
Figure RE-GDA0003792554090000091
By limiting the inner product computation to feature vectors sharing the same first index, a modification of the 4D volume structure can be used to compute the 3D relevant volume:
Figure RE-GDA0003792554090000101
as with 4D volumes, the computation of 3D volumes can be efficiently achieved using a single matrix multiplication, which can be easily computed on the GPU and takes only a small fraction of the total run time.
In stereo rectification, we can generally assume that all differences are positive; thus, the volume of interest need only be calculated for positive differences in practice. However, an advantage of calculating the entire volume is that the operation can be implemented using highly optimized matrix multiplication. This simplifies the overall architecture, allowing us to use general-purpose operations without the need to customize the GPU kernel
A relevant pyramid: we constructed a 4-level pyramid of correlation quantities by repeating the last dimension of the average pool. The kth level of the pyramid consists of k levels of volumes, using a 1D average pool, kernel size of 2, step size of 2, generating a new volume C k +1 Dimension of H × W × W/2 k . Each layer of the pyramid has an increased field of view, but by assembling the last dimension, we preserve the high resolution information present in the original image, which enables us to recover very fine structures.
And (3) related searching: to index into the relevant pyramid, we define a lookup operator similar to that defined in RAFT
Figure RE-GDA0003792554090000102
Given a current estimate of disparity d, we construct a one-dimensional grid with integer offsets around the current disparity estimate, as shown in fig. 2. The grid is used to index from each level in the relevant pyramid. Since the grid values are real numbers, we use linear interpolation when indexing each volume. The retrieved values are then concatenated into a single element map.
The multi-level update operator.
From an initial starting point d 0 Predicting a series of disparity fields d as 0 1 ,…,d N }. In each iteration, we index the relevant volume using the current estimate of disparity, generating a set of relevant features. These features are delivered through 2 convolutional layers. Similarly, the current disparityThe estimate also passed 2 convolutional layers. Correlation, difference, and rindnnet extracted features, then concatenated and injected into the GRU. The GRU updates the hidden state. The new hidden state is then used to predict the disparity update.
A plurality of hidden states: the original RAFT performs the update entirely at a fixed high resolution. One problem with this approach is that as the number of GRU updates increases, the receptive field increases very slowly. This may be problematic for scenes with non-textured areas and little local information. We solve this problem by proposing a multi-resolution update operator that runs simultaneously on feature maps of 1/4, 1/8 and 1/16 resolutions. In our experiments, we demonstrate that better generalization performance can be obtained using the multi-resolution update operator.
The GRUs are cross-connected by using the hidden states of each other as inputs, as shown in fig. 3. The correlation lookup and the final disparity update are performed by the GRU at the highest resolution. We also performed experiments using higher resolution models, with GRUs updated to 1/4, 1/8 and 1/16 at the input image resolution.
And (3) upsampling: the predicted disparity field is either 1/4 or 1/8 of the input image resolution. To output the full resolution disparity map, we use the same convex up-sampling method as RAFT.
The above embodiments are merely illustrative of a preferred embodiment, but not limiting. When the invention is implemented, appropriate replacement and/or modification can be carried out according to the requirements of users.
While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims (7)

1. An edge detection-based binocular vision stereo matching method is characterized by comprising the following steps:
given a pair of corrected images (I) L ,I R ) The goal is to estimate the disparity field d, giving I L The horizontal displacement of each pixel in (a) is similar to RAFT-Stereo and consists of three components: an edge feature extractor, a correlation pyramid, and a GRU-based update operator that iteratively retrieves features from the correlation pyramid and performs updates on the disparity field.
2. The binocular vision stereo matching method based on edge detection according to claim 1, wherein: extracting edge features:
encoding multitask semantic information through a semantic pyramid, performing feature extraction by the EdgeStereo, and based on accurate parallax estimation guided by boundary prompt, in the multitask model, the semantic pyramid can encode context prompt and reversely deduce full-size disparity by using an hourglass structure;
edgetree is a different multitasking structure for performing multi-stage training;
using two independent feature extractors, the two independent feature extractors being a feature encoder and a rindnnet-based encoder, respectively, the feature encoder being applied to the left and right images and mapping each image to a dense feature map, then used to construct a correlation volume; instance normalization is used in the feature encoder.
3. The binocular vision stereo matching method based on edge detection according to claim 2, wherein: application of rindnnet based encoder:
rindnnet is an end-to-end network;
the first stage is as follows: extracting common features of all edges, extracting the common features of all edges by using a backbone, wherein the edges have similar patterns in the intensity change of the image, and the backbone follows a ResNet-50 structure which consists of five repeated building blocks;
and a second stage: preparing unique features for the REs/IEs and NEs/DEs;
rindnnet learns the specific features of each edge type separately by the corresponding decoder in phase II, a decoder with two streams recovers the fine position information, and divides these four edge types into two groups, namely REs/IEs and NEs/DEs, for which features are prepared separately;
the underlying function captures detailed intensity changes reflected in the REs and IEs; the REs and IEs are related to the global context and surrounding objects provided by the high level functions, and it is desirable that the semantic cues give guidance on the perceptual strength changes before forwarding to the decoder, and that the low level features and high level cues are adaptively fused in a learnable manner by the weight layer WL.
4. The binocular vision stereo matching method based on edge detection according to claim 3, wherein: application of weight layer WL:
the WL contains two paths: the first path receives the advanced features res through the deconvolution layer 5 To restore high resolution, then two 3 x 3 convolutional layers with Batch Normalization (BN) and ReLU mining adaptive semantic cues; the other path is implemented as two convolutional layers with BN and ReLU, which encode the lower layer feature res 1-3 They are then fused by element-level multiplication, formally taking into account the underlying characteristics res 1-3 And advanced hints res 5 Generating fusion features for the REs and the IEs, respectively;
res 3 has a resolution of less than res 1 And res 2 Before feature splicing, in
res 3 Up using an up-sampling operation up (·); the fused features are fed into respective decoders, generating specific features with accurate position information for the REs and IEs, respectively.
5. The binocular vision stereo matching method based on edge detection according to claim 4, wherein: using res 5 The specific features of NEs and DEs are achieved,
DEs and NEs share geometric cues and share weights of the second streams of NE decoder and DE decoder to learn collaborative geometric cues, while the first streams of NE decoder and DE decoder are responsible for learning specific features of REs and DEs, respectively;
of spatial cues
Figure FDA0003692228590000021
In series with the decoding feature, initial results for network elements and DEs, respectively, are generated.
The RINDNet integrates the initial result with the attention graph obtained by the attention module AM to generate a final result; the side notes provide a label for each location; the attention module AM can deduce the spatial relationship between the plurality of tags under pixel-level supervision through an attention mechanism, the attention map being used to activate the reaction of the relevant location; formally, given an input image X, the attention module AM learns spatial attention,
RINDNet includes three phases of initial outcome inference: extracting common features, preparing distinguishing features, and generating initial results and final predictions integrated by the attention module.
6. The binocular vision stereo matching method based on edge detection according to claim 1, wherein: correlation volume, correlation pyramid, and correlation lookup:
relative volume: using dot products between the feature vectors as a measure of visual similarity;
RAFT-stereo constructs a 4D correlation volume by calculating visual similarity between all pairs of pixels, limiting the calculation of the correlation volume to pixels sharing the same y-coordinate; given a feature mapping f, g e, respectively
From I L And I R Is prepared from
Figure FDA0003692228590000031
The 3D relevant volume is calculated using a modification of the 4D volume structure by limiting the inner product calculation to feature vectors sharing the same first index:
efficiently enabling computation of a 3D volume using a single matrix multiplication;
in rectified stereo, all differences are assumed to be positive; the correlation volume is calculated as a positive difference;
a relevant pyramid: by repetition ofAveraging the last dimension of the pool to construct a 4-layer correlation quantity pyramid; the kth level of the pyramid consists of k levels of volumes, using a 1D average pool, kernel size of 2, step size of 2, generating a new volume C k+1 Dimension of H × W × W/2 k (ii) a Each layer of the pyramid has an increased receptive field, and high-resolution information existing in an original image is kept by collecting the last dimension;
and (3) related searching: defines a search operator
Figure FDA0003692228590000032
Given a current estimated value of the parallax d, constructing a one-dimensional grid with integer offset around the current estimated value of the parallax; linear interpolation is used in indexing each volume; the retrieved values are then concatenated into a single element map.
7. The binocular vision stereo matching method based on edge detection according to claim 1, wherein: a multi-level update operator;
from an initial starting point d 0 Predicting a series of disparity fields d as 0 1 ,...,d N }; in each iteration, indexing the correlated volume using the current estimate of disparity, generating a set of correlated features; these features are delivered through 2 convolutional layers; the current disparity estimate also passes through 2 convolutional layers; correlating, differentiating and RINDNet extracted features, and then connecting and injecting GRUs; the GRU updates the hidden state; then using the new hidden state to predict the disparity update;
a plurality of hidden states: the raw RAFT performs the update entirely at a fixed high resolution; simultaneously running on feature maps of 1/4, 1/8 and 1/16 resolutions by a multi-resolution update operator;
GRUs were cross-connected using hidden states of each other as input, experimented with higher resolution models, GRUs updated to 1/4, 1/8 and 1/16 at the input image resolution;
and (3) upsampling: the predicted disparity field is 1/4 or 1/8 of the input image resolution; the full resolution disparity map is output using the same convex up-sampling method as RAFT.
CN202210670652.0A 2022-06-14 2022-06-14 Binocular vision stereo matching method based on edge detection Pending CN115049739A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210670652.0A CN115049739A (en) 2022-06-14 2022-06-14 Binocular vision stereo matching method based on edge detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210670652.0A CN115049739A (en) 2022-06-14 2022-06-14 Binocular vision stereo matching method based on edge detection

Publications (1)

Publication Number Publication Date
CN115049739A true CN115049739A (en) 2022-09-13

Family

ID=83162125

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210670652.0A Pending CN115049739A (en) 2022-06-14 2022-06-14 Binocular vision stereo matching method based on edge detection

Country Status (1)

Country Link
CN (1) CN115049739A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487741A (en) * 2021-06-01 2021-10-08 中国科学院自动化研究所 Dense three-dimensional map updating method and device
CN116128946A (en) * 2022-12-09 2023-05-16 东南大学 Binocular infrared depth estimation method based on edge guiding and attention mechanism

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487741A (en) * 2021-06-01 2021-10-08 中国科学院自动化研究所 Dense three-dimensional map updating method and device
CN113487741B (en) * 2021-06-01 2024-05-28 中国科学院自动化研究所 Dense three-dimensional map updating method and device
CN116128946A (en) * 2022-12-09 2023-05-16 东南大学 Binocular infrared depth estimation method based on edge guiding and attention mechanism
CN116128946B (en) * 2022-12-09 2024-02-09 东南大学 Binocular infrared depth estimation method based on edge guiding and attention mechanism

Similar Documents

Publication Publication Date Title
CN112347859B (en) Method for detecting significance target of optical remote sensing image
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
Cherabier et al. Learning priors for semantic 3d reconstruction
CN111275518A (en) Video virtual fitting method and device based on mixed optical flow
CN112926396A (en) Action identification method based on double-current convolution attention
CN115049739A (en) Binocular vision stereo matching method based on edge detection
CN111179187B (en) Single image rain removing method based on cyclic generation countermeasure network
CN110197505A (en) Remote sensing images binocular solid matching process based on depth network and semantic information
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN112651360B (en) Skeleton action recognition method under small sample
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN115984494A (en) Deep learning-based three-dimensional terrain reconstruction method for lunar navigation image
CN116205962B (en) Monocular depth estimation method and system based on complete context information
CN115359372A (en) Unmanned aerial vehicle video moving object detection method based on optical flow network
CN112288772B (en) Channel attention target tracking method based on online multi-feature selection
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN113807356A (en) End-to-end low visibility image semantic segmentation method
Xu et al. AutoSegNet: An automated neural network for image segmentation
Chen et al. PDWN: Pyramid deformable warping network for video interpolation
CN116485867A (en) Structured scene depth estimation method for automatic driving
CN115187638A (en) Unsupervised monocular depth estimation method based on optical flow mask
Zhou et al. Attention transfer network for nature image matting
Du et al. Srh-net: Stacked recurrent hourglass network for stereo matching
CN113888399B (en) Face age synthesis method based on style fusion and domain selection structure
Huang et al. ES-Net: An efficient stereo matching network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination