CN116630387A

CN116630387A - Monocular image depth estimation method based on attention mechanism

Info

Publication number: CN116630387A
Application number: CN202310735294.1A
Authority: CN
Inventors: 韩冰; 熊燕南; 施道典; 高新波; 杨铮
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-08-22

Abstract

The invention discloses a monocular image depth estimation method based on an attention mechanism. The method mainly solves the problem that in the prior art, the depth estimation precision of a tiny area with the illumination difference and pixel depth value change of an image is low. The implementation scheme is as follows: reading monocular image depth estimation data set data and preprocessing the monocular image depth estimation data set data; extracting features of the pre-processed data using a Swin transducer network as an encoder network; constructing an aggregation structure for optimizing feature output global information features of the preprocessed data; constructing a decoder network based on an attention mechanism to optimally decode the output characteristics of the encoder and the aggregation structure to obtain decoder output characteristics; a depth prediction network is constructed and the image depth is predicted using the output characteristics of the decoder. The method and the device remarkably improve the precision of monocular image depth estimation, have better depth estimation effect in areas with small image illumination difference and pixel depth value change, and can be used for automatic driving, robots and three-dimensional reconstruction.

Description

Monocular image depth estimation method based on attention mechanism

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a monocular image depth estimation method which can be used for automatic driving, robots and three-dimensional reconstruction.

Background

Monocular image depth estimation is one of the basic tasks in the field of computer vision, is widely applied to the fields of automatic driving, robots, three-dimensional reconstruction and the like, and has extremely high application value. The task of monocular image depth estimation is to predict pixel-by-pixel depth values for an RGB image.

In recent years, the problem of monocular image depth estimation has been subject to extensive attention. Existing algorithms mainly utilize encoder-decoder structures. In the encoder-decoder architecture, the encoder is used to extract image features, and the decoder ultimately outputs prediction results by iteratively optimizing the features extracted by the encoder. For the design of the codec, most methods employ convolutional neural networks; recently, many methods have utilized deformer structure transformers to enhance the feature extraction and processing capabilities of the network, and others have utilized conditional random field predictive image energy functions.

Compared with the traditional method, the convolutional neural network has strong image feature extraction capability. Eigen et al first applied convolutional neural networks to a monocular depth estimation task. On this basis, various tasks of achieving monocular depth estimation by using a convolutional neural network and improving the task are presented. The multi-scale method BTS provides a local plane guiding layer, and aims to fuse all layers of features in the decoding process. The block attention network PWA devised a patch-based attention mechanism that focused on each local area.

Although convolutional neural network-based approaches remain very popular, they also have drawbacks. The receptive field of the convolutional neural network is local and cannot model global information of the image, which causes a monocular image depth estimation method based on the convolutional neural network to encounter performance bottlenecks. Because of the greater receptive field of transfomers than convolutional neural networks and the ability to model long-term dependencies, there is increasing interest in computer vision tasks, many of which apply transfomers to monocular depth estimation. The adaptive interval method AdaBins predicts an adaptive interval by utilizing a minisize visual deformer structure miniViT. The neural window conditional random field method NeWCRFs is a powerful feature extraction capability that utilizes the moving window deformer structure Swin-Transformer. The section construction method BinsFormer uses a transducer as a decoder. Although these transducer-based methods achieve better monocular depth estimation performance. However, these methods mainly use the monocular depth estimation task as a regression task, so that there are problems of slow convergence speed and sub-optimization.

To alleviate this problem, another branch of research then regards the monocular depth estimation task as a classification task. The ordinal number regression network DORN firstly regards the monocular depth estimation as an ordinal number classification regression task, and designs an effective ordinal number classification regression depth estimation loss function. However, the ordinal classification regression method proposed by DORN can generate obvious depth discontinuity on the predicted depth map due to discretization of the depth values, thereby affecting the visual effect. To address this problem, adabin further considers monocular depth estimation as a classification regression task. The problem of unsmooth transition of the depth value is effectively relieved through the center point of the linear combination interval. The pixel construction method PixelFormer provides a lightweight interval generation module to reduce the complexity of the model. Although these methods have made great progress in the accuracy of depth estimation, they lack modeling of long-term dependencies of image pixels and attention to specific areas of the image, and thus result in inaccurate predictions of pixel depth values in areas of poor image illumination and small depth variations.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a monocular image depth estimation method based on an attention mechanism, so as to improve the accuracy of pixel depth value prediction in a tiny area with poor illumination and depth change of an image and obtain better monocular image depth estimation performance.

In order to achieve the above purpose, the technical idea of the invention is as follows: enhancing global information of the features through a global information aggregation structure; by designing self-attention network branches in the decoder, correlation between pixels is established; by designing a regional attention network branch in the decoder, the attention of the network to a specific region of the image is improved; by designing the classified regression prediction structure, the convergence speed and the optimization effect of the network are improved. Through the design of the structures, the accuracy of the invention in predicting the pixel depth value in the image illumination difference and depth change micro-area is finally improved, and better monocular image depth estimation performance is obtained.

According to the above thought, the technical scheme of the invention comprises the following steps:

1. the monocular image depth estimation method based on the attention mechanism is characterized by comprising the following steps of:

(1) Reading training data and test data from a monocular image depth estimation data set, and sequentially performing preprocessing of rotation, scaling, overturning, adjustment and normalization on an image of the training data to obtain training tensor data; preprocessing the images of the test data sequentially through adjustment and normalization to obtain test tensor data;

(2) Using a Swin transducer network comprising 4 cascading Swin Transformer stage modules as an encoder network, respectively inputting training tensor data and test tensor data into the Swin transducer network to respectively obtain training image characteristics E output by the 4 cascading modules ₁ ，E ₂ ，E ₃ And E is ₄ Testing image features E ₁ '，E' ₂ ，E ₃ 'and E' ₄ ；

(3) Global information using aggregated structural enhancement features:

3a) The 4 average pooling layers with pooling proportions of 1,2,3 and 6 are connected in parallel to form a pyramid pooling module, and training image characteristics E output by a top layer encoder are obtained ₄ Input to the pyramid pooling module to extract multi-scale information and combine the multi-scale information with input features E ₄ Splicing, and generating new features with global information through a convolution layer;

3b) The new feature with global information is subjected to feature optimization through the existing convolution self-attention module, and the optimized global information feature X is obtained ₄ ；

(4) Constructing a decoder network based on an attention mechanism:

4a) Establishing a window-based self-attention module consisting of a window self-attention sub-module, a shift window self-attention sub-module and a Pixel Shuffle layer cascade;

4b) Establishing a regional attention module which is formed by connecting an average pooling layer with a maximum pooling layer in parallel and cascading with a convolution layer, a sigmoid layer and a transposed convolution layer;

4c) The window-based self-attention module is connected in parallel with the region-attention module to form a decoder module,

4d) Cascading 4 decoder modules to form a decoder network based on an attention mechanism;

(5) Feature E of encoder network output using attention-based decoder network ₁ 、E ₂ 、E ₃ 、E ₄ And global information feature X of aggregate structure output ₄ Performing layer-by-layer optimized decoding to sequentially obtain characteristic X ₃ 、X ₂ 、X ₁ And X ₀ Final output characteristic X ₀ ；

(6) Constructing a depth map prediction network:

6a) Establishing an adaptive interval center prediction module sequentially comprising a convolution layer and two parallel average pooling layers and maximum pooling layers with learnable parameters, wherein the adaptive interval center prediction module is used for adaptively predicting the interval center of the depth value of an input image;

6b) The method comprises the steps of establishing a probability head module consisting of a convolution layer and a softmax layer, and predicting probability vectors corresponding to the centers of depth value intervals of input images;

6c) The self-adaptive interval center prediction module and the probability head module are connected in parallel to form a depth map prediction network;

(7) Predicting a depth map depth corresponding to the input image:

7a) Output characteristics X of decoder network ₀ The self-adaptive interval center prediction module and the probability head module are respectively input into a depth map prediction network, and the self-adaptive interval center c (b) of the image depth value and the probability vector v are output;

7b) And linearly combining the self-adaptive interval center c (b) and the probability vector v to obtain a preliminary depth map, and restoring the preliminary depth map to the size of the input image through up-sampling operation to obtain a final depth map depth.

Compared with the prior art, the invention has the following advantages:

1) The invention can enhance the accuracy of image depth estimation of the network because the global information of the features is enhanced by the global information aggregation structure;

2) The decoder network based on the attention mechanism is convenient to establish the correlation among the pixels of the image and improve the attention of the network to the characteristic image area, so that the accuracy of predicting the pixel depth value in the tiny area with poor image illumination and depth change is improved;

3) According to the method, the depth map prediction network based on the self-adaptive classification regression mode is constructed, so that the problems of low convergence speed and sub-optimization existing in the regression prediction mode can be relieved, and the accuracy of image depth estimation of the network is improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of the structure of the present invention;

FIG. 3 is a schematic diagram of a decoder module according to the present invention;

FIG. 4 is a schematic diagram of a depth map prediction network according to the present invention;

FIG. 5 is a graph of the results of depth estimation of KITTI dataset images using the present invention and two prior methods, respectively;

fig. 6 is a graph of the results of depth estimation of an image of an nryuv 2 dataset with the present invention and two prior methods, respectively.

Detailed Description

Embodiments and effects of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the present embodiment are as follows:

step 1, training data and test data are obtained and preprocessed.

1.1 Training data and test data are read from a monocular image depth estimation dataset, which in this example contains two different datasets, a KITTI dataset and a nrv 2 dataset, respectively;

1.2 The image of the training data is randomly rotated by minus 1 degrees and 1 degree, then the image size is scaled according to different types of data sets, namely the size of the training data image is scaled to 352 multiplied by 1120 for the KITTI data set, the size of the training data image is scaled to 480 multiplied by 640 for the NYUV2 data set, the image is randomly turned over at a probability of 0.5, the brightness, the contrast, the saturation and the tone of the image are randomly adjusted at a probability of 0.5, and the normalization is carried out according to the following formula:

wherein x is _scale Is the normalized pixel value of the corresponding channel, x is the pixel value of a single channel of the RGB image, S is the standard deviation of the pixel value of the corresponding channel, mu is the mean value of the pixel value of the corresponding channel, and the normalized mean values of the three channels of RGB are respectively [0.485,0.456,0.406 ]]The standard deviation of the three channels is [0.229,0.224,0.225 ]]；

1.3 The image of the test data is scaled to the different types of data sets, i.e. the test data image size is scaled to 352 x 1216 for the KITTI data set and to 480 x 640 for the nryuv 2 data set, which is then normalized as follows:

wherein x is _scale Is the normalizationThe pixel value of the corresponding channel after the conversion, x is the pixel value of a single channel of the RGB image, S is the standard deviation of the pixel value of the corresponding channel, mu is the average value of the pixel value of the corresponding channel, and the normalized average value of the three channels of RGB is respectively [0.485,0.456,0.406 ]]The standard deviation of the three channels is [0.229,0.224,0.225 ]]；

And 2, extracting the characteristics of the image by using a Swin transducer network as an encoder network.

As shown in fig. 2, the Swin Transformer network is an existing network comprising 4 cascaded Swin Transformer stage modules, each stage module comprising a downsampling layer and a plurality of repetitions Swin Transformer Block, and this example uses the Swin-L version of the Swin Transformer as the encoder network;

respectively inputting the training data and the test data preprocessed in the step 1 into a Swin converter network, and respectively outputting 4 characteristics E of the training data from 4 cascade modules of the Swin converter network ₁ ，E ₂ ，E ₃ 、E ₄ 4 features of test data E ₁ '，E' ₂ ，E ₃ 'and E' ₄ ；

The encoder network structure and the respectively extracted image features are shown in table 1:

table 1: swin transducer network structure and extracted image features

H and w in table 1 are the height and width of the input image, respectively.

And 3, enhancing global information of the features by using the aggregation structure.

3.1 4 average pooling layers with pooling ratios of 1,2,3 and 6 are connected in parallel to form a pyramid pooling module, and training image characteristics E output by a top layer encoder are obtained ₄ Input to the pyramid pooling module to extract multi-scale information and combine the multi-scale information with input features E ₄ Splicing, and generating new features with global information through a convolution layer;

3.2 Feature optimization is carried out on the new feature with global information through the existing convolution self-attention module to obtain the optimized global information feature X ₄ ：

3.2.1 Projecting the new feature with global information through 31×1 convolutions, and processing the obtained intermediate features according to two parallel operations respectively, namely, one operation is to shift the intermediate features to obtain shifted new features, and the other operation is to take the intermediate features as query, key and value matrix and calculate self-attention features;

3.2.2 Adding the shifted new features and the self-attention features to obtain optimized global information features X ₄ 。

And 4, constructing a decoder network based on an attention mechanism.

Referring to fig. 3, this step is specifically implemented as follows:

4.1 A window-based self-attention module is established:

4.1.1 A window self-attention sub-module is established, which sequentially comprises two convolution layers with the parallel core size of 3 multiplied by 3 and the step length of 1, a regularization layer, a window dividing layer, a multi-layer perceptron, a multi-head self-attention layer based on windows and a multi-layer perceptron:

the convolution layer is used for changing the channel number of the input characteristic;

the regularization layer is used for regularizing the input features;

the window division layer is used for dividing input features according to 7×7 windows;

the first multi-layer perceptron is used for generating a query matrix Q, a key matrix K and a value matrix V;

the window-based multi-head self-attention layer calculates multi-head self-attention by using a query matrix Q, a key matrix K and a value matrix V;

the second multi-layer perceptron is used for increasing the nonlinear fitting capacity of the network;

4.1.2 A shift window self-attention sub-module is established, the structure of which is the same as that of the window self-attention sub-module, and the difference is that: in the window division layer, the position of each 7×7 window is moved 3×3 downward to the right;

4.1.3 A Pixel Shuffle layer is established, which includes the existing Pixel Shuffle operation for adjusting the channel number and resolution of the feature;

4.1.4 A window self-attention sub-module, a shift window self-attention sub-module and a Pixel Shuffle are hierarchically linked to form a window-based self-attention module;

4.2 Building a zone attention module):

4.2.1 Establishing an average pooling layer consisting of an average pooling operation for reducing the feature map size;

4.2.2 A maximum pooling layer formed by the maximum pooling operation is established for reducing the size of the feature map;

4.2.3 A convolution layer consisting of convolution operations with a kernel size of 7 x 7 is established for compressing the output characteristics of the average pooling layer and the maximum pooling layer in parallel;

4.2.4 Building a sigmoid layer formed by sigmoid operation, which is used for generating weights corresponding to different pixels for the characteristics of the convolution layer, and then carrying out dot product on the weights and the input characteristics of the regional attention module to obtain the output characteristics with regional attention;

4.2.5 A transpose convolution layer consisting of transpose convolution operations is built for adjusting the channel number and resolution of the feature;

4.2.6 The average pooling layer is connected with the maximum pooling layer in parallel, and then is connected with the convolution layer, the sigmoid layer and the transposed convolution layer in a hierarchical mode to form the regional attention module.

4.3 -connecting the window-based self-attention module in parallel with the regional attention module to form a decoder module;

4.4 4 decoder modules are cascaded to form an attention-based decoder network.

Step 5, feature E of encoder network output using attention-based decoder network ₁ 、E ₂ 、E ₃ 、E ₄ And global information feature X of aggregate structure output ₄ Performing layer-by-layer optimized decoding to obtain decoder network output characteristic X ₀ 。

5.1 Output characteristics E of layer 4 encoder module ₄ And output feature X of the aggregate structure ₄ A layer 4 decoder module input to the decoder network, obtaining an output characteristic X of the layer 4 encoder module ₃ ；

5.2 Output characteristics E of layer 3 encoder module ₃ And output feature X of layer 4 decoder module ₃ A layer 3 decoder module input to the decoder network, resulting in an output characteristic X of the layer 3 encoder module ₂ ；

5.3 Output characteristics E of layer 2 encoder module ₂ And output feature X of layer 3 decoder module ₂ Inputting a layer 2 decoder module of the decoder network to obtain an output characteristic X of the layer 2 encoder module ₁ ；

5.4 Output characteristics E of layer 1 encoder module ₁ And output feature X of layer 2 decoder module ₁ Inputting the layer 1 decoder module of the decoder network to obtain the output characteristic X of the layer 1 encoder module ₀ I.e. the output characteristics of the whole decoder network.

And 6, constructing a depth map prediction network.

Referring to fig. 4, the present step is specifically implemented as follows:

6.1 Establishing an adaptive interval center prediction module sequentially comprising a convolution layer and two parallel average pooling layers and maximum pooling layers with learnable parameters, and used for adaptively predicting the interval center of the depth value of an input image;

6.2 A probability head module consisting of a convolution layer and a softmax layer is established and is used for predicting a probability vector corresponding to the center of the depth value interval of the input image;

6.3 The self-adaptive interval center prediction module and the probability head module are connected in parallel to form a depth map prediction network;

and 7, predicting a depth map depth corresponding to the input image by using a depth map prediction network.

7.1 A section center c (b) of the predicted input image depth value:

7.1.1 Outputting the characteristics X from the decoder network ₀ Input to a convolution layer of kernel size 1X 1, X ₀ The number of channels is increased to 256;

7.1.2 Feature X after enlarging the number of channels ₀ Two corresponding tensors are obtained through two parallel average pooling layers and a maximum pooling layer respectively, and two learnable parameters rho are introduced ₁ And ρ ₂ Multiplying the two tensors respectively, and adding the two tensors to obtain the interval width of the depth value of the input imageTo capture global information and extract salient information simultaneously;

7.1.3 Using the interval width b, calculating the interval center of the depth value of the input image

Wherein d _min And d _max Respectively representing the minimum value and the maximum value of the depth value;

7.2 A probability vector v corresponding to the input image depth value interval center c (b):

7.2.1 Outputting the characteristics X from the decoder network ₀ Input to a convolution layer of core size 3X 3, step size and fill 1, output channel number 256 to vary X ₀ The number of channels;

7.2.2 Feature X after changing the number of channels ₀ Generating a probability vector v corresponding to the center of the depth value interval of the input image through a softmax layer;

7.3 Predicting a depth map depth corresponding to the input image:

and linearly combining the self-adaptive interval center c (b) and the probability vector v to obtain a preliminary depth map, and restoring the preliminary depth map to the size of the input image through up-sampling operation to obtain a final depth map depth.

The effects of the present invention are further described below in connection with simulation experiments.

1. Experimental conditions:

the computer processor is Intel (R) Xeon (R) Gold 6148CPU@2.40GHz, the running memory 128G, and the display card is NVIDIA GeForce RTX 3090GPU with a display memory of 24 GB.

The operating system is 64 bits Ubuntu 18.04 (LTS) and the deep learning framework used is PyTorch (version 1.10.1). All network training adopts a back propagation algorithm to calculate residuals of each layer, and uses an Adam optimizer to update network parameters, wherein parameters of the Adam optimizer are beta 1=0.9, beta 2=0.999, and a weight attenuation term is 0.01. 20 epochs are set, the initial learning rate is 1e-5, and the initial learning rate is linearly reduced to 1e-6 in the training iteration process. The network is trained using the SILog penalty.

The experiment was evaluated using a KITTI data set and an NYUV2 data set, using error metrics including average absolute relative error AbsRel, average squared relative error SqRel, root mean square error RMSE, root mean square logarithmic error RMSElog, and average logarithmic error log10. At the same time, the threshold value is delta<1.25 ⁱ The model is evaluated for an accuracy index of i=1, 2, 3. The lower the error index, the better, and the higher the accuracy index.

The KITTI dataset is the most popular depth estimation benchmark that captures outdoor scenes by sensor devices on moving vehicles. The data set includes a stereoscopic image and a corresponding LIDAR scan. Training was performed using 23158 images and testing was performed using 697 images according to the partitioning of Eigen. During training and testing, the input image is cropped and the output is upsampled to the tag size. Furthermore, the upper limit of depth prediction is limited to within 10 meters.

The nryuv 2 dataset is an indoor dataset comprising 120K RGB and depth pairs collected in 464 indoor scenes. Training and testing was performed using official divisions, with 36253 images of 249 scenes and 654 images of 215 scenes. During training and testing, the input image is cropped and the upper limit of depth prediction is limited to within 10 meters.

2. The experimental contents are as follows:

experiment 1: under the experimental conditions, monocular image depth estimation is respectively carried out on the KITTI data set by using the method and the existing 11 image depth estimation methods, the objective evaluation index results are shown in a table 2, and subjective results are shown in a figure 5, wherein figure 5 (a) is an input image, figure 5 (b) is a depth map predicted by the existing method NeWCRFs, figure 5 (c) is a depth map predicted by the existing method PixelFormer, figure 5 (d) is a predicted depth map of the invention, and figure 5 (e) is a real depth map corresponding to the input image.

Table 2 results of the present invention and 11 prior art methods for image depth estimation on a test set of KITTI data sets.

The 11 existing methods in table 2 are respectively:

eigen et al. Eigen et al first propose a monocular image depth estimation algorithm based on convolutional neural network;

DORN: the Lo et al firstly proposes a monocular image depth estimation algorithm based on ordinal classification;

yin et al. A monocular image depth estimation algorithm based on virtual normal geometric constraints is proposed by Yin et al;

BTS: a monocular image depth estimation algorithm based on a local planar guide layer proposed by Lee et al;

PWA: a monocular image depth estimation algorithm based on the attention mechanism of the patch is proposed by Lee et al;

adamins: a monocular image depth estimation algorithm based on an adaptive depth interval is proposed by Farooq et al;

p3Depth: a monocular depth estimation algorithm based on segmented plane priors is proposed by Patil et al;

depthformer: a monocular image depth estimation algorithm based on local global information fusion is proposed by Agarwal et al;

BinsFormer: a transform-based monocular image depth estimation algorithm proposed by Kolesnikov et al;

NeWCRFs: a monocular image depth estimation algorithm based on conditional random fields is proposed by Yuan et al;

PixelFormer: a monocular image depth estimation algorithm based on pixel query optimization was proposed by Agarwal et al.

As can be seen from Table 2, the present invention achieves optimal or suboptimal results in all objective evaluation metrics as compared to 11 prior art methods.

As can be seen from fig. 5, the present invention achieves better depth estimation results and generates a clearer profile in areas of the image such as walls, windows, and crowns where there is poor illumination or little depth variation, as compared to 11 prior art comparison methods.

Experiment 2: the method and the existing 12 image depth estimation methods are used for monocular image depth estimation on the NYUV2 data set under the experimental conditions, the results of objective evaluation indexes are shown in table 3, subjective results are shown in fig. 6, wherein fig. 6 (a) is an input image, fig. 6 (b) is a depth map predicted by the existing method NeWCRFs, fig. 6 (c) is a depth map predicted by the existing method Pixelfomer, fig. 6 (d) is a predicted depth map of the invention, and fig. 6 (e) is a real depth map corresponding to the input image.

Table 3 results of image depth estimation on a test set of nryuv 2 datasets by the present invention and the existing 12 methods.

The prior art method DAV in table 3 is a monocular image depth estimation algorithm based on a local planar guide layer proposed by Lee et al, and the other 11 prior art methods in table 3 are consistent with the methods in table 2.

As can be seen from Table 3, the present invention achieves optimal or suboptimal results in all objective evaluation metrics as compared to 12 prior art methods.

As can be seen from fig. 6, the present invention achieves better depth estimation results and produces a clearer profile in areas of the image such as windows, displays, keyboards, where there is poor illumination or little depth variation, as compared to 12 prior art comparison methods.

Experiment 3: various new methods are obtained by deleting or replacing various modules provided by the invention, monocular image depth estimation is carried out on the KITTI data set by the methods, and the objective evaluation index results are shown in table 4.

Table 4 ablation experimental results of image depth estimation on test set of KITTI dataset by various new methods

In table 4, "ABD" refers to the attention-based decoder network proposed by the present invention, "CR" refers to the classification regression depth prediction network proposed by the present invention, and "GIA" refers to the global information aggregation structure proposed by the present invention. "Baseline" refers to the novel method of the present invention after deletion of "GIA", replacement of "ABD" and "CR". "Abs Rel", "Sq Rel", "RMSE" and "RMSE log" are error indicators, and "∈" indicates that the lower the indicator value, the better; "delta<1.25"、"δ<1.25 ² "and" delta<1.25 ³ "is an accuracy index," +.f "indicates that the higher the index value, the better.

Baseline: the method is a first new method for estimating the image depth by adopting an existing regression prediction network, wherein an encoder is a Swin transform network, a decoder is a decoder of the existing UNet network;

Baserine+ABD: the second new method is formed by replacing the decoder of the Baseline method with the decoder network based on the attention mechanism provided by the invention on the basis of the Baseline method;

Baserine+ABD+CR: the method is a third novel method formed by replacing the existing regression prediction network with the classified regression depth prediction network provided by the invention on the basis of the Baseline+ABD method.

Baserine+ABD+CR+GIA (invention): the method is formed by adding the global information aggregation structure proposed by the invention on the basis of the Baseline+ABD+CR method, namely the invention.

As can be seen from Table 4, compared with all other methods, the error index value of the monocular image depth estimation on the KITTI data set is the lowest, the accuracy index value is the highest, that is, the image depth estimation performance of the method is the best, and after the attention mechanism-based decoder network, the classification regression depth prediction network and the global information aggregation structure proposed by the method are added or replaced by other methods, the error index value of the monocular image depth estimation on the KITTI data set is continuously reduced, the accuracy index value is continuously increased, that is, the image depth estimation performance is continuously improved, so that the effectiveness of each module proposed by the method is proved.

In conclusion, the optimal monocular image depth estimation results are obtained on the KITTI data set and the NYUV2 data set, and the fact that the optimal results are obtained on the subjective results and the objective evaluation indexes of the monocular image depth estimation task is proved.

Claims

(1) Reading training data and test data from a monocular image depth estimation database, and sequentially performing preprocessing of rotation, scaling, overturning, adjustment and normalization on images of the training data to obtain the training data; the method comprises the steps of sequentially carrying out scaling and normalization preprocessing on images of test data to obtain the test data;

(2) Using a Swin transducer network comprising 4 cascading Swin Transformer stage modules as an encoder network, respectively inputting training data and test data into the Swin transducer network to respectively obtain training image characteristics E output by the 4 cascading modules ₁ ，E ₂ ，E ₃ And E is ₄ Test image feature E' ₁ ，E' ₂ ，E′ ₃ And E' ₄ ；

(3) Global information using aggregated structural enhancement features:

(4) Constructing a decoder network based on an attention mechanism:

(6) Constructing a depth map prediction network:

(7) Predicting a depth map depth corresponding to the input image:

2. The method of claim 1, wherein the preprocessing of sequentially rotating, scaling, turning, adjusting and normalizing the image of the training data in step (1) is to randomly rotate the input image by [ -1 °,1 ° ], scale the image size according to the data set, randomly turn the image at a probability of 0.5, randomly adjust the brightness, contrast, saturation and hue of the image at a probability of 0.5, and normalize the image by the following normalization formula;

where x is the pixel value of a single channel of the RGB image, μ is the mean of the pixel values of the corresponding channels, S is the standard deviation of the pixel values of the corresponding channels, x _scale Is the normalized corresponding channel pixel value.

3. The method of claim 1, wherein 4 Swin Transformer stage modules in step (2) are identical in structure, each Swin Transformer stage module comprising a downsampling layer and a plurality of repetitions of Swin Transformer Block.

4. The method according to claim 1, characterized in that the feature optimization of the new features with global information by means of a convolution self-attention module in step 3 b) is achieved as follows:

the input feature map is projected through 31×1 convolutions, the obtained intermediate features are respectively processed according to two parallel operations, namely, one operation is to shift the intermediate features, and the other operation is to take the intermediate features as query, key and value matrix, and calculate self-attention;

and adding the output characteristics of the two parallel operations to obtain the optimized characteristics.

5. The method of claim 1, wherein the window self-attention sub-module, the shifted window self-attention sub-module, and the Pixel Shuffle layer in step 4 a) are configured and function as follows:

the window self-attention sub-module sequentially comprises two convolution layers with the parallel core size of 3 multiplied by 3 and the step length of 1, a regularization layer, a window dividing layer, a multi-layer perceptron, a multi-head self-attention layer based on windows and a multi-layer perceptron; wherein:

the regularization layer is used for regularizing the input features;

the shift window self-attention sub-module has the same structure as the window self-attention sub-module and is different in that: in the window division layer, the position of each 7×7 window is moved 3×3 downward to the right;

the Pixel Shuffle layer is an existing Pixel Shuffle operation that is used to adjust the channel number and resolution of features.

6. The method of claim 1, wherein the structure and function of each layer in the regional attention module constructed in step 4 b) is as follows:

the average pooling layer is an average pooling operation and is used for reducing the size of the feature map;

the maximum pooling layer is the maximum pooling operation and is used for reducing the size of the feature map;

the convolution layer is a convolution operation with the kernel size of 7 multiplied by 7 and is used for compressing the output characteristics of the average pooling layer and the maximum pooling layer which are connected in parallel;

the sigmoid layer is used for generating weights corresponding to different pixels for the characteristics of the convolution layer, and then carrying out dot product on the weights and the input characteristics of the regional attention module to obtain output characteristics with regional attention;

the transposed convolution layer is a transposed convolution operation that is used to adjust the channel number and resolution of the feature.

7. The method of claim 1, wherein the feature E of the encoder network output is used in step (5) by the attention-based decoder network ₁ 、E ₂ 、E ₃ 、E ₄ And global information feature X of aggregate structure output ₄ Performing layer-by-layer optimized decoding, and realizing the following steps:

5a) Output characteristics E of layer 4 encoder module ₄ And output feature X of the aggregate structure ₄ Inputting a layer 4 decoder module of the decoder network to obtain an output characteristic X of the layer 4 encoder module ₃ ；

5b) Output feature E of layer 3 encoder module ₃ And output feature X of layer 4 decoder module ₃ Inputting a layer 3 decoder module of the decoder network to obtain an output characteristic X of the layer 3 encoder module ₂ ；

5c) Will be the firstOutput feature E of a layer 2 encoder module ₂ And output feature X of layer 3 decoder module ₂ Inputting a layer 2 decoder module of the decoder network to obtain an output characteristic X of the layer 2 encoder module ₁ ；

5d) Output characteristics E of layer 1 encoder module ₁ And output feature X of layer 2 decoder module ₁ Inputting the layer 1 decoder module of the decoder network to obtain the output characteristic X of the layer 1 encoder module ₀ I.e. the output characteristics of the whole decoder network.

8. The method according to claim 1, wherein the adaptive interval center prediction module in step 6 a) adaptively predicts an interval center of the depth value of the input image by:

6a1) Outputting the decoder network output characteristic X ₀ Input to a convolution layer of kernel size 1X 1 for expanding X ₀ To 256 channels;

6a2) Feature X after enlarging the number of channels ₀ Two corresponding tensors are obtained through two parallel average pooling layers and a maximum pooling layer respectively, and two learnable parameters rho are introduced ₁ And ρ ₂ Multiplying the two tensors respectively, and adding the two tensors to obtain the interval width of the depth value of the input imageTo capture global information and extract salient information simultaneously;

6a3) Calculating a section center of the depth value of the input image by using the section width b

Wherein d _min And d _max Representing the minimum and maximum values of the depth values, respectively。

9. The method according to claim 1, wherein the probability head module in step 6 b) predicts a probability vector corresponding to the center of the input image depth value interval by:

6b1) Outputting the decoder network output characteristic X ₀ A convolution layer with a kernel size of 3X 3, step size and padding of 1, and 256 output channels is input for changing X ₀ The number of channels;

6b2) Feature X after changing the number of channels ₀ And generating a probability vector v corresponding to the center of the depth value interval of the input image through a softmax layer.