CN116128946A

CN116128946A - Binocular infrared depth estimation method based on edge guiding and attention mechanism

Info

Publication number: CN116128946A
Application number: CN202211588573.1A
Authority: CN
Inventors: 耿可可; 王金虎; 殷国栋; 汤文成; 成小龙; 孙宇啸; 丁鹏博; 王子威; 柳志超
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-05-16
Anticipated expiration: 2042-12-09
Also published as: CN116128946B

Abstract

The invention discloses a binocular infrared depth estimation method based on edge guiding and attention mechanism, relates to the technical field of binocular vision in computer vision, and solves the technical problem of low depth estimation precision caused by the defects of unclear texture, blurred edge, unobvious characteristics and the like of an infrared image; constructing a mixed attention module in the high-dimensional feature map to acquire depth association of different channels and spatial positions among the features to be matched, and promoting the effective depth reasoning of a subsequent network; meanwhile, an edge guiding module is introduced, an edge-depth joint loss function is constructed to generate a foreground depth map with clear edges, smooth depth and no depth cavity, and the possibility is provided for an intelligent body to maintain normal operation in a low-illumination environment.

Description

Binocular infrared depth estimation method based on edge guiding and attention mechanism

Technical Field

The application relates to the technical field of binocular vision in computer vision, in particular to a binocular infrared depth estimation method based on edge guiding and attention mechanisms.

Background

As a research hotspot in the field of computer vision, binocular depth estimation has been widely used in the fields of three-dimensional reconstruction, autopilot, mobile robots, and the like. For a group of corrected stereo images captured by a binocular camera, the essence of depth estimation is to find a matching point corresponding to each pixel when the cost value is minimum, and take the parallax between the left matching point and the right matching point at this time as parallax output. Compared with the traditional depth estimation algorithm, the deep learning-based algorithm can effectively optimize the problem of inappropriateness in image depth estimation, and can learn and estimate the depth information of the shielding and weak texture areas by using priori knowledge. However, most of researches are only based on visible light images, and compared with a visible light camera limited by low illumination and severe environment, an infrared camera can still image through receiving infrared electromagnetic waves emitted by environmental objects under the low illumination environment, so that the purpose of sensing the surrounding environment is achieved. While it is possible to grasp the characteristic of the infrared camera and develop a depth estimation research based on the infrared image, on the other hand, the research of the infrared image has certain difficulty due to inherent defects such as unclear texture, blurred edge, unobvious characteristics and the like.

Disclosure of Invention

The application provides a binocular infrared depth estimation method based on an edge guiding and attention mechanism, which aims to make up for inherent defects of unclear texture, blurred edge, unobvious characteristics and the like of an infrared image, so that an intelligent agent can maintain normal operation in a low-illumination environment.

The technical aim of the application is achieved through the following technical scheme:

a binocular infrared depth estimation method based on edge steering and attention mechanisms, comprising:

s1: constructing a depth estimation network framework based on edge guidance;

s2: training the depth estimation network framework to obtain a first depth estimation network;

s3: verifying the first depth estimation network, completing verification when the accuracy of the first depth estimation network meets a preset threshold, otherwise, repeating the step S2;

s4: performing binocular infrared depth estimation through the first depth estimation network;

the depth estimation network framework comprises an image preprocessing module, a feature extraction module, a pyramid pooling module, a mixed attention mechanism module, a stacking hourglass module and an edge guiding module.

The beneficial effects of this application lie in: according to the binocular infrared depth estimation method based on the edge guiding and attention mechanism, an image preprocessing module based on gamma correction and median filtering is introduced to enhance image edge and detail information, so that more deep feature representations which can be mined are provided for a convolutional neural network; constructing a mixed attention module in the high-dimensional feature map to acquire depth association of different channels and spatial positions among the features to be matched, and promoting the effective depth reasoning of a subsequent network; meanwhile, an edge guiding module is introduced, an edge-depth joint loss function is constructed to generate a foreground depth map with clear edges, smooth depth and no depth cavity, and the possibility is provided for an intelligent body to maintain normal operation in a low-illumination environment.

Drawings

FIG. 1 is a flow chart of a method described herein;

FIG. 2 is a schematic diagram of a hybrid attention mechanism module;

fig. 3 is a schematic view of an edge guide module.

Detailed Description

The technical scheme of the application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the binocular infrared depth estimation method based on the edge guiding and attention mechanism described in the application comprises the following steps:

s1: an edge-guided depth estimation network framework is constructed.

Specifically, the depth estimation network framework comprises an image preprocessing module, a feature extraction module, a pyramid pooling module, a mixed attention mechanism module, a stacking hourglass module and an edge guiding module.

The preprocessing of the image preprocessing module comprises the following steps:

(1) And (3) preprocessing operations based on gamma correction and median filtering are carried out on the binocular infrared images subjected to distortion correction and Bouguet polar correction, so as to obtain preprocessed images IML and IMR respectively.

After distortion correction and Bouguet polar correction, the left image and the right image in the binocular infrared image are subjected to preprocessing operation based on gamma correction and median filtering.

The gamma correction is to edit the gamma curve of the image to edit the nonlinear tone of the image, detect the dark color part and the light color part in the image signal, and increase the proportion of the dark color part and the light color part, thereby improving the contrast effect of the image. Gamma correction is denoted as V _out ＝V _in ^γ The method comprises the following steps: when gamma is larger than 1, the contrast of the high gray area of the image is enhanced; when γ is less than 1, the contrast of the low gray area of the image is enhanced; when γ is equal to 1, the original is not changed.

The basic principle of median filtering is to replace the value of a point in a digital image or digital sequence with the median of the values of points in a neighborhood of the point, so that the surrounding pixel values are close to the true value, thereby eliminating isolated noise points.

(2) IML and IMR are input to the feature extraction module, and IML is input to the edge steering module.

Wherein the gamma correction is denoted as V _out ＝V _in ^γ The method comprises the following steps: when gamma is larger than 1, the contrast of the high gray area of the image is enhanced; when γ is less than 1, the contrast of the low gray area of the image is enhanced; when γ is equal to 1, the original is not changed.

The workflow of the feature extraction module comprises:

(1) Performing 3×3 convolution downsampling operation with a step length of 2 on IML and IMR, and performing batch normalization and Relu activation to obtain a final product with a size of 2

Characteristic map FL of (2) ₁ And characteristic diagram FR ₁ ；

(2) Map FL of the characteristic ₁ And characteristic diagram FR ₁ Respectively transmitting the residual blocks into 3 continuous different residual blocks, and carrying out batch normalization and Relu activation to obtain a block with the size of

Characteristic map FL of (2) ₂ And characteristic diagram FR ₂ ；

(3) Map FL of the characteristic ₂ And characteristic diagram FR ₂ Respectively transmitting the residual blocks into 16 continuous different residual blocks, and carrying out batch normalization and Relu activation to obtain a block with the size of

Characteristic map FL of (2) ₃ And characteristic diagram FR ₃ ；

(4) Map FL of the characteristic ₃ And characteristic diagram FR ₃ Respectively transmitting into 3 continuous different residual blocks to execute expansion convolution operation with expansion coefficient of 2, and performing batch normalization and Relu activation to obtain a size of 2

Characteristic map FL of (2) ₄ And characteristic diagram FR ₄ ；

(5) Map FL of the characteristic ₄ And featuresFIG. FR ₄ Performing expansion convolution operation with expansion coefficient of 4 in 3 different residual blocks respectively, and performing batch normalization and Relu activation to obtain a block with size of 4

Characteristic map FL of (2) ₅ And characteristic diagram FR ₅ ；

(6) Map FL of the characteristic ₅ And characteristic diagram FR ₅ And inputting the data to the pyramid pooling module.

The workflow of the pyramid pooling module comprises:

(1) For characteristic map FL ₅ And characteristic diagram FR ₅ Performing adaptive global average pooling operations with the sizes of 64×64, 32×32, 16×16 and 8×8 respectively to obtain four feature images with different resolutions, respectively reducing the dimensionality of the four feature images with different resolutions through a convolution kernel of 1×1, and performing double linear interpolation up-sampling operation to obtain four feature images with the same resolution respectively;

(2) Map FL of the characteristic ₅ Corresponding four feature maps and feature map FL with the same resolution ₃ And characteristic map FL ₅ Splicing to obtain a size of

Characteristic map FL of (2) ₆ ；

(3) Map FR of the characteristic ₅ Corresponding four feature graphs with the same resolution and feature graph FR ₃ And characteristic diagram FR ₅ Splicing to obtain a size of

Is of characteristic diagram FR of (2) ₆ ；

(4) Map FL of the characteristic ₆ And characteristic diagram FR ₆ Input to the mixed attention mechanism module.

As shown in fig. 2, the mixed attention mechanism module includes a channel attention mechanism and a spatial attention mechanism, and the workflow of the mixed attention mechanism module includes:

(1) For characteristic map FL ₆ And characteristic diagram FR ₆ Respectively carrying out global maximum pooling and average pooling of the space to obtain two channel descriptions of 1 multiplied by C, respectively sending the two channel descriptions of 1 multiplied by C into a two-layer neural network with an activation function of Relu, outputting two C-dimensional characteristics, adding the two C-dimensional characteristics, obtaining a weight coefficient Ac through a Sigmoid activation function, and respectively mixing Ac with a characteristic map FL ₆ And characteristic diagram FR ₆ Multiplication to obtain intermediate characteristic FL ₇ and FR₇ ；

(2) For FL ₇ and FR₇ Respectively carrying out maximum pooling and average pooling of one channel dimension to obtain two H×W×1 channel descriptions, splicing the two H×W×1 channel descriptions together according to the channels, then obtaining a weight coefficient As through a 7×7 convolution layer with an activation function of Sigmoid, and respectively mixing As with FL ₇ and FR₇ Multiplication to obtain a characteristic map FL ₈ And characteristic diagram FR ₈ 。

After the mixed attention mechanism module is processed, the characteristic diagram FL is processed according to the channel and the parallax dimension ₈ And characteristic diagram FR ₈ Splicing to obtain a size of

Four-dimensional cost volume C of (2) _disp (u, v, d,:). Then for four-dimensional cost volume C _disp (u, v, d,:) performing bilinear interpolation and parallax-depth conversion to obtain a depth cost volume C _depth (u, v, z,:); finally, the depth cost volume C _depth (u, v, z,:) to the stacked hourglass module.

Wherein the disparity-depth conversion is expressed as:

wherein ,f_U Representing a horizontal focal length; b represents a baseline length; d (u, v) and Z (u, v) represent the disparity and depth of the feature map at the (u, v) position, respectively.

The stacked hourglass module includes three hourglass networks, each hourglass networkFor depth cost volume C _depth (u, v, z,:) processing to obtain three sizes respectively

For the three dimensions as

Performing bilinear interpolation operation on the feature map of (2) to obtain feature maps S with dimensions Z×H×W respectively _depth1 、S _depth2 and S_depth3 . When training the depth estimation network framework, S is carried out _depth1 、S _depth2 and S_depth3 As an initial prediction result S; when the depth estimation network framework is verified, the feature map S output by the last hourglass network is obtained _depth3 As an initial prediction result S. Performing depth regression on the S to obtain an initial depth map DM with the size of H multiplied by W; the initial depth map DM is input to the edge steering module.

Wherein the depth regression is expressed as:

as shown in fig. 3, the workflow of the edge guiding module includes:

(1) By edge detection operators

Performing joint extraction on edge information of the preprocessed image IML to obtain edge density E (u, v); at the same time by an edge detection operator->

Extracting edge information of the initial depth map DM to obtain edge density e (u, v);

wherein ,

(2) Constructing an edge loss function by E (u, v) and E (u, v), expressed as:

(3) Constructing a depth loss function corresponding to the initial depth map DM, wherein the depth loss function is expressed as:

(4) Finally, the joint loss function L is obtained _tatal Expressed as:

L _total ＝αL _edge +βL _depth (7)

wherein alpha and beta represent balance coefficients of corresponding loss terms;

(5) By the joint loss function L _tatal And performing joint supervision to obtain a final prediction depth map with clear edges.

As a specific embodiment, the edge detection operator comprises a Laplacian operator and a Canny operator, and the edge information of the preprocessed image or the initial depth map is jointly extracted through the Laplacian operator and the Canny operator, wherein the Canny operator can detect the weak edge of the image under the noise condition and is complementary with the Laplacian operator in performance that the stepped edge can be accurately positioned but is easily interfered by noise, so that the effect of edge enhancement is achieved.

S2: and training the depth estimation network framework to obtain a first depth estimation network.

Specifically, step S2 includes:

s21: inputting an image_2 and an image_3 in a training set and a verification set in the KITTI data set into a GAN network obtained by training in advance to realize style migration from a color image to an infrared image;

s22: constructing a data set through a binocular infrared image file and calib.txt and velodyne.bin files corresponding to the binocular infrared image file;

s23: training using Adam optimizer; wherein, the initial learning rate is set to be 1e-4, and the learning rate is automatically reduced in the training process; beta ₁ ＝0.9，β ₂ ＝0.999；

S24: after each iteration, training loss and verification loss are calculated, verification loss of each iteration is compared, model parameters with minimum verification loss are stored, and a model corresponding to the model parameters with minimum verification loss is the first depth estimation network.

S3: and (3) verifying the first depth estimation network, when the accuracy of the first depth estimation network meets a preset threshold, completing verification, otherwise, repeating the step (S2).

Specifically, step S3 includes:

s31: inputting the binocular infrared image in the verification set into the first depth estimation network to obtain a predicted depth map Z _p ；

S32: predicting the depth map Z _p The 3D coordinates of the radar point cloud are converted into pixel plane coordinates, the depth information z is reserved, and the accuracy of the depth estimation is calculated and expressed as follows:

z＝Z(u,v)；(8)

wherein x, y, z represent the spatial coordinate components of the radar point cloud; c _U 、c _V Representing a camera principal point coordinate component; f (f) _U 、f _V Representing the horizontal and vertical focal lengths of the camera, respectively.

S4: and performing binocular infrared depth estimation through the first depth estimation network.

The foregoing is an exemplary embodiment of the present application, the scope of which is defined by the claims and their equivalents.

Claims

1. A binocular infrared depth estimation method based on edge steering and attention mechanisms, comprising:

s1: constructing a depth estimation network framework based on edge guidance;

the depth estimation network framework comprises an image preprocessing module, an edge guiding module, a feature extraction module, a pyramid pooling module, a mixed attention mechanism module and a stacking hourglass module.

2. The method of claim 1, wherein the preprocessing of the image preprocessing module comprises:

preprocessing operations based on gamma correction and median filtering are carried out on the binocular infrared images subjected to distortion correction and Bouguet polar correction, so that preprocessed images IML and IMR are obtained respectively;

inputting an IML and an IMR to the feature extraction module, and inputting an IML to the edge guide module;

3. The method of claim 2, wherein the workflow of the feature extraction module comprises:

performing one step length on IML and IMR respectivelyA 3×3 convolution downsampling operation of 2, and batch normalization and Relu activation are performed to finally obtain a size of

Characteristic map FL of (2) ₁ And characteristic diagram FR ₁ ；

Map FL of the characteristic ₁ And characteristic diagram FR ₁ Respectively transmitting the residual blocks into 3 continuous different residual blocks, and carrying out batch normalization and Relu activation to obtain a block with the size of

Characteristic map FL of (2) ₂ And characteristic diagram FR ₂ ；

Map FL of the characteristic ₂ And characteristic diagram FR ₂ Respectively transmitting the residual blocks into 16 continuous different residual blocks, and carrying out batch normalization and Relu activation to obtain a block with the size of

Characteristic map FL of (2) ₃ And characteristic diagram FR ₃ ；

Map FL of the characteristic ₃ And characteristic diagram FR ₃ Respectively transmitting into 3 continuous different residual blocks to execute expansion convolution operation with expansion coefficient of 2, and performing batch normalization and Relu activation to obtain a size of 2

Characteristic map FL of (2) ₄ And characteristic diagram FR ₄ ；

Map FL of the characteristic ₄ And characteristic diagram FR ₄ Performing expansion convolution operation with expansion coefficient of 4 in 3 different residual blocks respectively, and performing batch normalization and Relu activation to obtain a block with size of 4

Characteristic map FL of (2) ₅ And characteristic diagram FR ₅ ；

Map FL of the characteristic ₅ And characteristic diagram FR ₅ And inputting the data to the pyramid pooling module.

4. The method of claim 3, wherein the workflow of the pyramid pooling module comprises:

for characteristic map FL ₅ And characteristic diagram FR ₅ Performing adaptive global average pooling operations with the sizes of 64×64, 32×32, 16×16 and 8×8 respectively to obtain four feature images with different resolutions, respectively reducing the dimensionality of the four feature images with different resolutions through a convolution kernel of 1×1, and performing double linear interpolation up-sampling operation to obtain four feature images with the same resolution respectively;

map FL of the characteristic ₅ Corresponding four feature maps and feature map FL with the same resolution ₃ And characteristic map FL ₅ Splicing to obtain a size of

Characteristic map FL of (2) ₆ ；

Map FR of the characteristic ₅ Corresponding four feature graphs with the same resolution and feature graph FR ₃ And characteristic diagram FR ₅ Splicing to obtain a size of

Is of characteristic diagram FR of (2) ₆ ；

Map FL of the characteristic ₆ And characteristic diagram FR ₆ Input to the mixed attention mechanism module.

5. The method of claim 4, wherein the workflow of the mixed attention mechanism module comprises:

for characteristic map FL ₆ And characteristic diagram FR ₆ Respectively carrying out global maximum pooling and average pooling of the space to obtain two channel descriptions of 1 multiplied by C, respectively sending the two channel descriptions of 1 multiplied by C into a two-layer neural network with an activation function of Relu, outputting two C-dimensional characteristics, adding the two C-dimensional characteristics, obtaining a weight coefficient Ac through a Sigmoid activation function, and dividing Ac into componentsOther and characteristic map FL ₆ And characteristic diagram FR ₆ Multiplication to obtain intermediate characteristic FL ₇ and FR₇ ；

For FL ₇ and FR₇ Respectively carrying out maximum pooling and average pooling of one channel dimension to obtain two H×W×1 channel descriptions, splicing the two H×W×1 channel descriptions together according to the channels, then obtaining a weight coefficient As through a 7×7 convolution layer with an activation function of Sigmoid, and respectively mixing As with FL ₇ and FR₇ Multiplication to obtain a characteristic map FL ₈ And characteristic diagram FR ₈ 。

6. The method as recited in claim 5, comprising:

characteristic map FL according to channel and disparity dimensions ₈ And characteristic diagram FR ₈ Splicing to obtain a size of

Four-dimensional cost volume C of (2) _disp (u,v,d,:)；

For four-dimensional cost volume C _disp (u, v, d,:) performing bilinear interpolation and parallax-depth conversion to obtain a depth cost volume C _depth (u,v,z,:)；

Rolling the depth cost to C _depth (u, v, z,:) input to the stacked hourglass module;

wherein the disparity-depth conversion is expressed as:

7. The method of claim 6, whereinThe stacked hourglass module comprises three hourglass networks, each of which is respectively corresponding to a depth cost volume C _depth (u, v, z,:) processing to obtain three sizes respectively

Is +.>

Performing bilinear interpolation operation on the feature map of (2) to obtain feature maps S with dimensions Z×H×W respectively _depth1 、S _depth2 and S_depth3 ；

When training the depth estimation network framework, S is carried out _depth1 、S _depth2 and S_depth3 As an initial prediction result S; when the depth estimation network framework is verified, the feature map S output by the last hourglass network is obtained _depth3 As an initial prediction result S;

performing depth regression on the S to obtain an initial depth map DM with the size of H multiplied by W;

inputting the initial depth map DM to the edge guiding module;

wherein the depth regression is expressed as:

8. the method of claim 7, wherein the workflow of the edge-directed module comprises:

by edge detection operators

The edge information of the preprocessed image IML is jointly extracted to obtain edge density E (u, v)The method comprises the steps of carrying out a first treatment on the surface of the At the same time by an edge detection operator->

wherein ,

constructing an edge loss function by E (u, v) and E (u, v), expressed as:

and constructs a depth loss function corresponding to the initial depth map DM, expressed as:

finally, the joint loss function L is obtained _tatal Expressed as:

L _total ＝αL _edge +βL _depth (7)

by the joint loss function L _tatal And performing joint supervision to obtain a final prediction depth map with clear edges.

9. The method of claim 8, wherein training the depth estimation network framework in step S2 comprises:

inputting an image_2 and an image_3 in a training set and a verification set in the KITTI data set into a GAN network obtained by training in advance to realize style migration from a color image to an infrared image;

constructing a data set through a binocular infrared image file and calib.txt and velodyne.bin files corresponding to the binocular infrared image file;

training using Adam optimizer; wherein, the initial learning rate is set to be 1e-4, and the learning rate is automatically reduced in the training process; beta ₁ ＝0.9，β ₂ ＝0.999；

After each iteration, training loss and verification loss are calculated, verification loss of each iteration is compared, model parameters with minimum verification loss are stored, and a model corresponding to the model parameters with minimum verification loss is the first depth estimation network.

10. The method of claim 9, wherein in step S3, when validating the first depth estimation network, comprising:

inputting the binocular infrared image in the verification set into the first depth estimation network to obtain a predicted depth map Z _p ；

Predicting the depth map Z _p The 3D coordinates of the radar point cloud are converted into pixel plane coordinates, the depth information z is reserved, and the accuracy of the depth estimation is calculated and expressed as follows:

z＝Z(u,v)；(8)