CN115049739A

CN115049739A - Binocular vision stereo matching method based on edge detection

Info

Publication number: CN115049739A
Application number: CN202210670652.0A
Authority: CN
Inventors: 杨文帮
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-13

Abstract

The invention discloses a binocular vision stereo matching method based on edge detection, which comprises the following steps: the method comprises the steps of feature extraction, a correlation pyramid and a GRU updating module, and when the existing Stereo matching algorithm is used for Stereo matching of images, the existing work usually depends on a 3D convolution network to process Stereo cost, and RAFT-Stereo is used as an integral framework, so that only a light-weight cost volume constructed by 2D convolution and single matrix multiplication is needed. In order to solve the problem of no texture and boundary, multitask semantic information is coded through a semantic pyramid, a context coder in RAFT-Stereo is replaced by a coder based on RINDNet through EdgeStereo, feature extraction is carried out, and the method has strong boundary sensing capability. By using an iterative network, we can easily do the accuracy of the efficiency by stopping early. We use multi-level GRU units to maintain hidden states at multiple resolutions, cross-connect, but still generate a single highly discriminative difference update.

Description

Binocular vision stereo matching method based on edge detection

Technical Field

The invention relates to the field of computer binocular stereo vision, in particular to a binocular vision stereo matching method based on edge detection.

Background

It is known that light rays in a scene are collected in a binocular imaging system of a human being and are transmitted to a brain containing hundreds of millions of neurons through a nerve center to be processed in parallel, and real-time, high-definition and accurate depth perception information is obtained.

Binocular stereo vision is an important form of computer vision that obtains three-dimensional information of a scene by simulating binocular vision characteristics. The binocular camera acquires scene information from different directions, and the distance from the corresponding point to the imaging plane is calculated according to the parallax, so that depth perception and three-dimensional reconstruction are obtained. The binocular stereo matching algorithm is always a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a binocular vision stereo matching method based on edge detection so as to solve the technical problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions.

A binocular vision stereo matching method based on edge detection comprises the following steps:

the method comprises the following steps: replacing a context encoder of RAFT-stereo with an encoder based on RINDNet by using an EdgeStereo context integrated residual stereo matching pyramid network;

step two: performing feature extraction on the image by using a feature encoder and an RINDNet-based encoder;

step three: using the dot product between feature vectors as a measure of visual similarity, the computation of the correlation volume is limited to pixels sharing the same y-coordinate, given a feature mapping f, g ∈ is respectively from I _L And I _R Is prepared by

By limiting the inner product computation to feature vectors sharing the same first index. Constructing a 4-layer correlation pyramid by repeating the last dimension of the average pool;

step four: GRU update block step, we predict a series of disparity fields { d from initial starting point d0 ═ 0 ₁ ，…，d _N }. In each iteration, we index the relevant volume using the current estimate of disparity, generating a set of relevant features; these features are delivered through 2 convolutional layers. Similarly, the current disparity estimate also passes through 2 convolutional layers; correlation, difference and rindnnet extraction yields features, which are then concatenated and injected into the GRU. The GRU updates the hidden state. The new hidden state is then used to predict the disparity update.

The binocular vision stereo matching method based on edge detection provided by the invention has the technical advantages that: the multitask semantic information is coded through a semantic pyramid, a context coder in the RAFT-Stereo is replaced by a coder based on RINDNet through the EdgeStereo, feature extraction is carried out, and the method has strong boundary perception capability. By using an iterative network, we can easily do the accuracy of the efficiency by stopping early. We use multi-level GRU units to maintain hidden states at multiple resolutions, cross-connect, but still generate a single highly discriminative difference update.

Drawings

FIG. 1 is a flow chart of the steps of the present invention.

FIG. 2 is a flowchart of a binocular vision stereo matching method according to an embodiment;

FIG. 3 is a one-dimensional grid graph with integer offsets;

FIG. 4 is a cross-connect diagram of GRU inputs;

fig. 5 is a schematic diagram of (a) the weighting layer, (b) the decoder, and (c) the attention module of rindnnet.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted from the various embodiments, or may be replaced with other elements, materials, methods. In some cases, operations related to the present application are not shown or described in the specification, so as to avoid the core part of the present application being overwhelmed by excessive description, and it is not necessary for those skilled in the art to describe these related operations in detail, so that they can fully understand the related operations according to the description in the specification and the general technical knowledge of the present field.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, various steps or actions in the description of the method may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

As shown in fig. 1 to 5, the present invention provides a binocular vision stereo matching method based on edge detection, the method comprising the steps of:

s1: acquiring images under two viewpoints;

s2: the left image and the right image are respectively sent to a feature extractor to extract a dense feature map;

s3: constructing a related cost body according to dot products of feature graphs extracted from the left image and the right image;

s4: constructing a relevant gold tower according to the relevant cost bodies of the two images;

s5: and synchronously updating the characteristic graphs of corresponding respective rates through a multi-level updating module.

Specifically, in the embodiment of the present invention, a pair of corrected images (I) is given _L ，I _R ) Our goal is to estimate the disparity field d, given I _L The horizontal displacement of each pixel in the array. Similar to RAFT-Stereo, our method consists of three main components: feature extractor, correlation pyramid, and GRU based update operator, as shown in fig. 1. The update operator iteratively retrieves features from the relevant pyramid and performs updates on the disparity field.

And the indexing number S in the residual pyramid is consistent with the structure of the encoder. The smallest scale in the remaining pyramid yields the disparity map d _s (

Full resolution) and then use the residual map r on a larger scale _s Upsampling and thinning successively until a full-size disparity map d is obtained ₀ . The formula is shown in equation (1), where u (·) denotes upsampling by a factor of 2 and s denotes pyramid scale (e.g., 0 denotes full resolution).

d _s ＝u(d _s+1 )+r _s ，0≤s＜S#(1)

Finally, we regularize the edge mapping as an edge-aware smoothness loss, which is a valid guide for disparity estimation. Smoothness loss for parallax

We encourage parallax local smoothing, with a penalty penalizing depth variations in non-edge regions. To account for depth discontinuities in object contours, previous methods weight the regularization term according to image gradients. Instead, we weight the term according to the gradient of the edge map, which is semantically more meaningful than the intensity variation. As shown in the formula (2), N represents the number of pixels,

represents the sum of parallax gradients

Gradient representing edge probability map

In the second phase, we supervise the regression difference on the S scale on the stereo data set. With deep supervision, the total loss is

Wherein C is _s Representing the loss at the scale s. In addition to the parallax smoothness penalty, we use the parallax regression penalty

And (4) performing supervised learning, as shown in formula (3).

Wherein the content of the first and second substances,

representing a ground truth disparity map. Thus, the total loss at the scale s becomes

Wherein λ is _ds Is the loss weight of smoothness penalty. In addition, the weights of the edge subnetworks are fixed.

We use two independent feature extractors, called feature encoder and RINDNet based encoder. The feature encoder is applied to the left and right images and maps each image to a dense feature map, which is then used to construct the relevant volume.

The network consists of a series of residual blocks and downsampling layers, generating feature maps for 256 channels at input image resolutions of 1/4 or 1/8, depending on the number of downsampling layers used in our experiment. We use example normalization in feature encoders, while in rindnnet based encoders rindnnet is an end-to-end network, first phase: common features of all edges are extracted. We first extract common features for all edges using the skeleton, since these edges have similar patterns in the intensity variations of the image. The backbone follows the structure of ResNet-50, which consists of five repeating building blocks. Specifically, the feature maps for the five blocks from ResNet-50 are denoted as res ₁ 、res ₂ 、res ₃ 、res ₄ And res ₅ . Then, IThey generate spatial cues based on the above features.

It is well known that different layers of CNN features encode different levels of look/semantic information and contribute differently to different edge types. Specifically, the underlying elements are mapped to res _1-3 Focus more on lower-level cues (e.g., color, texture, and brightness), while the top-level maps res _4-5 Object awareness information is supported. Therefore, it is beneficial to capture multiple levels of spatial response from different feature map layers. Given a plurality of function maps res _1-5 We obtain a spatial response map:

wherein the space response

From spatial layers

Learning, the space layer is composed of a convolutional layer and a deconvolution layer.

And a second stage: unique features are prepared for the REs/IEs and NEs/DEs.

Then, RIND Net learns the specific features of each edge type separately by the corresponding decoder in phase II.

We have designed a decoder with two streams to recover the fine position information as shown in fig. 4. In the proposed architecture, two stream decoders can work together and learn more powerful functions from different perspectives.

Although the four decoders have the same structure, some special designs are proposed for different types of edges, and we will give a detailed description below. To properly distinguish each type of edge and to better describe our work, we next group these four edge types into two groups, namely REs/IEs and NEs/DEs, for which features are prepared separately.

REs and IEs. In practice, the underlying functionality (e.g., theE.g. res _1-3 ) Capturing detailed intensity changes often reflected in REs and IEs. In addition, the REs and IEs are associated with advanced functions (e.g., REs) ₅ ) The provided global context is related to surrounding objects. It is therefore desirable that the semantic cues give the proper guidance of the change in perceived intensity before forwarding to the decoder. Furthermore, it is worth noting that simply connecting low-level and high-level features may be computationally too expensive due to the increased number of parameters. Therefore, we propose that the Weight Layer (WL) adaptively fuses low-level features and high-level hints in a learnable manner, without increasing the dimensionality of the features.

As shown in FIG. 4, the WL contains two paths: the first path receives the advanced features res through the deconvolution layer ₅ To restore high resolution, then two 3 x 3 convolutional layers with Batch Normalization (BN) and ReLU mining adaptive semantic cues; the other path is implemented as two convolutional layers with BN and ReLU, which encode the lower layer feature res _1-3 After that, they are fused by element-level multiplication. Formally, the underlying characteristic res is taken into account _1-3 And advanced reminder res ₅ We generate fusion signatures for REs and IEs, respectively,

wherein WL of the REs and IEs are respectively represented as

And

g ^r /g ⁱ is a fusion feature of REs/IEs [ ·]Are connected in series. Note that res ₃ Has a resolution of less than res ₁ And res ₂ Thus before feature stitching, at res ₃ An up-sampling operation up (-) is used to improve resolution. Next, the fused features are fed into respective decoders to generate specific features with accurate position information for REs and IEs, respectively.

Wherein

And

respectively representing the decoding of the REs and IEs, and f ^r /f ⁱ Is a decoding characteristic diagram of the REs/IEs.

NEs, and DEs. Since high-level features (e.g., res5) express strong semantic responses, which are usually concentrated in NEs and DEs, we obtained specific features of NEs and DEs with res5,

wherein NE decoder and DE decoder are respectively represented as

And

f ⁿ /f ^d is a decoding feature of the NE/DE. Since DEs and NEs typically share some relevant geometric cues, we share weights of the second stream of NE and DE decoders to learn the collaborative geometric cues. Meanwhile, the first streams of the NE decoder and the DE decoder are responsible for learning the specific features of REs and DEs, respectively.

Here O ^r /O ⁱ Is the initial prediction of "REs/IEs". Decision heads for the REs and IEs are named separately

And

modeling was done for 3 × 3 convolutional layers and 1 × 1 convolutional layers. Note that the REs and IEs do not directly rely on location cues provided by the top level, and thus the spatial cues

Not for them. In contrast, of all spatial cues

In tandem with the decoding feature, initial results for network elements and DEs are generated,

wherein the content of the first and second substances,

and

decision headers for NEs and DEs, respectively, consisting of three 1 × 1 convolutional layers for integrating cues at each location. In short, O ═ O ^r ，O ⁱ ，O ⁿ ，O ^d Denotes the initial result set.

Attention is directed to the module. Finally, rindnnet integrates the initial results with the attention map obtained by the Attention Module (AM) to generate the final results. Since different types of edges are reflected in different positions, it is necessary to pay more attention to the relevant positions when predicting each type of edge. Fortunately, the side annotation provides a label for each location. Thus, the proposed AM can infer spatial relationships between multiple tags under pixel-level supervision through an attention mechanism. The attention map may be used to activate a reaction of the relevant location. Formally, given the input image X, the AM learning space attention,

A＝{A ^b ，A ^r ，A ⁱ ，A ⁿ ，A ^d }＝soft(ψ _att (X))，#(10)

wherein A is the normalized attention by the softmax function, and A ^b ，A ^r ，A ⁱ ，A ⁿ ，A ^d ∈[0，1] ^W×H Are attention points corresponding to background, REs, IEs, NEs and DEs, respectively. Obviously, if the label is marked as a pixel, the position of that pixel should be assigned a higher attention value. AM psi _att Implemented by the first building block of ResNet, four 3 x 3 convolutional layers (each layer followed by the ReLU and BN operations), and one 1 x 1 convolutional layer, as shown in fig. 4. Finally, the initial result is integrated with the attention map, a final result y is generated,

y＝signmoid(O⊙(A ^{{r，i，n，d}} ))，#(11)

where £ is multiplication by element.

The unique features for optimizing the detection of the four edge types are learned through the above. Rindnnet includes three phases of initial outcome inference: extracting common features, preparing distinguishing features, and generating initial results and final predictions integrated by the attention module.

A correlation pyramid, and a correlation lookup.

Relative volume: we use the dot product between feature vectors as a measure of visual similarity. Similar to how RAFT-stereo constructs 4D correlation volumes by computing visual similarities between all pairs of pixels, we limit the computation of the correlation volumes to pixels sharing the same y-coordinate. Given a feature mapping f, g ∈ respectively from I _L And I _R Is prepared by

By limiting the inner product computation to feature vectors sharing the same first index, a modification of the 4D volume structure can be used to compute the 3D relevant volume:

as with 4D volumes, the computation of 3D volumes can be efficiently achieved using a single matrix multiplication, which can be easily computed on the GPU and takes only a small fraction of the total run time.

In stereo rectification, we can generally assume that all differences are positive; thus, the volume of interest need only be calculated for positive differences in practice. However, an advantage of calculating the entire volume is that the operation can be implemented using highly optimized matrix multiplication. This simplifies the overall architecture, allowing us to use general-purpose operations without the need to customize the GPU kernel

A relevant pyramid: we constructed a 4-level pyramid of correlation quantities by repeating the last dimension of the average pool. The kth level of the pyramid consists of k levels of volumes, using a 1D average pool, kernel size of 2, step size of 2, generating a new volume C ^k ⁺¹ Dimension of H × W × W/2 ^k . Each layer of the pyramid has an increased field of view, but by assembling the last dimension, we preserve the high resolution information present in the original image, which enables us to recover very fine structures.

And (3) related searching: to index into the relevant pyramid, we define a lookup operator similar to that defined in RAFT

Given a current estimate of disparity d, we construct a one-dimensional grid with integer offsets around the current disparity estimate, as shown in fig. 2. The grid is used to index from each level in the relevant pyramid. Since the grid values are real numbers, we use linear interpolation when indexing each volume. The retrieved values are then concatenated into a single element map.

The multi-level update operator.

From an initial starting point d ₀ Predicting a series of disparity fields d as 0 ₁ ，…，d _N }. In each iteration, we index the relevant volume using the current estimate of disparity, generating a set of relevant features. These features are delivered through 2 convolutional layers. Similarly, the current disparityThe estimate also passed 2 convolutional layers. Correlation, difference, and rindnnet extracted features, then concatenated and injected into the GRU. The GRU updates the hidden state. The new hidden state is then used to predict the disparity update.

A plurality of hidden states: the original RAFT performs the update entirely at a fixed high resolution. One problem with this approach is that as the number of GRU updates increases, the receptive field increases very slowly. This may be problematic for scenes with non-textured areas and little local information. We solve this problem by proposing a multi-resolution update operator that runs simultaneously on feature maps of 1/4, 1/8 and 1/16 resolutions. In our experiments, we demonstrate that better generalization performance can be obtained using the multi-resolution update operator.

The GRUs are cross-connected by using the hidden states of each other as inputs, as shown in fig. 3. The correlation lookup and the final disparity update are performed by the GRU at the highest resolution. We also performed experiments using higher resolution models, with GRUs updated to 1/4, 1/8 and 1/16 at the input image resolution.

And (3) upsampling: the predicted disparity field is either 1/4 or 1/8 of the input image resolution. To output the full resolution disparity map, we use the same convex up-sampling method as RAFT.

The above embodiments are merely illustrative of a preferred embodiment, but not limiting. When the invention is implemented, appropriate replacement and/or modification can be carried out according to the requirements of users.

While embodiments of the invention have been disclosed above, it is not intended to be limited to the uses set forth in the specification and examples. It can be applied to all kinds of fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. It is therefore intended that the invention not be limited to the exact details and illustrations described and illustrated herein, but fall within the scope of the appended claims and equivalents thereof.

Claims

1. An edge detection-based binocular vision stereo matching method is characterized by comprising the following steps:

given a pair of corrected images (I) _L ，I _R ) The goal is to estimate the disparity field d, giving I _L The horizontal displacement of each pixel in (a) is similar to RAFT-Stereo and consists of three components: an edge feature extractor, a correlation pyramid, and a GRU-based update operator that iteratively retrieves features from the correlation pyramid and performs updates on the disparity field.

2. The binocular vision stereo matching method based on edge detection according to claim 1, wherein: extracting edge features:

encoding multitask semantic information through a semantic pyramid, performing feature extraction by the EdgeStereo, and based on accurate parallax estimation guided by boundary prompt, in the multitask model, the semantic pyramid can encode context prompt and reversely deduce full-size disparity by using an hourglass structure;

edgetree is a different multitasking structure for performing multi-stage training;

using two independent feature extractors, the two independent feature extractors being a feature encoder and a rindnnet-based encoder, respectively, the feature encoder being applied to the left and right images and mapping each image to a dense feature map, then used to construct a correlation volume; instance normalization is used in the feature encoder.

3. The binocular vision stereo matching method based on edge detection according to claim 2, wherein: application of rindnnet based encoder:

rindnnet is an end-to-end network;

the first stage is as follows: extracting common features of all edges, extracting the common features of all edges by using a backbone, wherein the edges have similar patterns in the intensity change of the image, and the backbone follows a ResNet-50 structure which consists of five repeated building blocks;

and a second stage: preparing unique features for the REs/IEs and NEs/DEs;

rindnnet learns the specific features of each edge type separately by the corresponding decoder in phase II, a decoder with two streams recovers the fine position information, and divides these four edge types into two groups, namely REs/IEs and NEs/DEs, for which features are prepared separately;

the underlying function captures detailed intensity changes reflected in the REs and IEs; the REs and IEs are related to the global context and surrounding objects provided by the high level functions, and it is desirable that the semantic cues give guidance on the perceptual strength changes before forwarding to the decoder, and that the low level features and high level cues are adaptively fused in a learnable manner by the weight layer WL.

4. The binocular vision stereo matching method based on edge detection according to claim 3, wherein: application of weight layer WL:

the WL contains two paths: the first path receives the advanced features res through the deconvolution layer ₅ To restore high resolution, then two 3 x 3 convolutional layers with Batch Normalization (BN) and ReLU mining adaptive semantic cues; the other path is implemented as two convolutional layers with BN and ReLU, which encode the lower layer feature res _1-3 They are then fused by element-level multiplication, formally taking into account the underlying characteristics res _1-3 And advanced hints res ₅ Generating fusion features for the REs and the IEs, respectively;

res ₃ has a resolution of less than res ₁ And res ₂ Before feature splicing, in

res ₃ Up using an up-sampling operation up (·); the fused features are fed into respective decoders, generating specific features with accurate position information for the REs and IEs, respectively.

5. The binocular vision stereo matching method based on edge detection according to claim 4, wherein: using res ₅ The specific features of NEs and DEs are achieved,

DEs and NEs share geometric cues and share weights of the second streams of NE decoder and DE decoder to learn collaborative geometric cues, while the first streams of NE decoder and DE decoder are responsible for learning specific features of REs and DEs, respectively;

of spatial cues

In series with the decoding feature, initial results for network elements and DEs, respectively, are generated.

The RINDNet integrates the initial result with the attention graph obtained by the attention module AM to generate a final result; the side notes provide a label for each location; the attention module AM can deduce the spatial relationship between the plurality of tags under pixel-level supervision through an attention mechanism, the attention map being used to activate the reaction of the relevant location; formally, given an input image X, the attention module AM learns spatial attention,

RINDNet includes three phases of initial outcome inference: extracting common features, preparing distinguishing features, and generating initial results and final predictions integrated by the attention module.

6. The binocular vision stereo matching method based on edge detection according to claim 1, wherein: correlation volume, correlation pyramid, and correlation lookup:

relative volume: using dot products between the feature vectors as a measure of visual similarity;

RAFT-stereo constructs a 4D correlation volume by calculating visual similarity between all pairs of pixels, limiting the calculation of the correlation volume to pixels sharing the same y-coordinate; given a feature mapping f, g e, respectively

From I _L And I _R Is prepared from

The 3D relevant volume is calculated using a modification of the 4D volume structure by limiting the inner product calculation to feature vectors sharing the same first index:

efficiently enabling computation of a 3D volume using a single matrix multiplication;

in rectified stereo, all differences are assumed to be positive; the correlation volume is calculated as a positive difference;

a relevant pyramid: by repetition ofAveraging the last dimension of the pool to construct a 4-layer correlation quantity pyramid; the kth level of the pyramid consists of k levels of volumes, using a 1D average pool, kernel size of 2, step size of 2, generating a new volume C ^k+1 Dimension of H × W × W/2 ^k (ii) a Each layer of the pyramid has an increased receptive field, and high-resolution information existing in an original image is kept by collecting the last dimension;

and (3) related searching: defines a search operator

Given a current estimated value of the parallax d, constructing a one-dimensional grid with integer offset around the current estimated value of the parallax; linear interpolation is used in indexing each volume; the retrieved values are then concatenated into a single element map.

7. The binocular vision stereo matching method based on edge detection according to claim 1, wherein: a multi-level update operator;

from an initial starting point d ₀ Predicting a series of disparity fields d as 0 ₁ ，...，d _N }; in each iteration, indexing the correlated volume using the current estimate of disparity, generating a set of correlated features; these features are delivered through 2 convolutional layers; the current disparity estimate also passes through 2 convolutional layers; correlating, differentiating and RINDNet extracted features, and then connecting and injecting GRUs; the GRU updates the hidden state; then using the new hidden state to predict the disparity update;

a plurality of hidden states: the raw RAFT performs the update entirely at a fixed high resolution; simultaneously running on feature maps of 1/4, 1/8 and 1/16 resolutions by a multi-resolution update operator;

GRUs were cross-connected using hidden states of each other as input, experimented with higher resolution models, GRUs updated to 1/4, 1/8 and 1/16 at the input image resolution;

and (3) upsampling: the predicted disparity field is 1/4 or 1/8 of the input image resolution; the full resolution disparity map is output using the same convex up-sampling method as RAFT.