CN112132880A

CN112132880A - Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image

Info

Publication number: CN112132880A
Application number: CN202010910048.1A
Authority: CN
Inventors: 潘树国; 赵涛; 高旺; 魏建胜; 盛超
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2020-12-25
Anticipated expiration: 2040-09-02
Also published as: CN112132880B

Abstract

The invention discloses a real-time dense depth estimation method based on sparse measurement and monocular RGB images, which adopts a self-attention mechanism and a long and short dense skip connection technology to extract more useful information from sparse depth measurement. Meanwhile, a lightweight network design method for real-time depth estimation is provided by combining a depth supervision technology. The experimental result verifies the effectiveness of the self-attention mechanism and the long and short dense jump connection technology and the deep supervision technology. Experimental results show that the method provided by the invention can balance the network prediction precision and the reasoning speed to the maximum extent so as to obtain the maximum efficiency. By adopting the Depth error estimated in real time by the method, under the condition that the sparse sampling rate is less than 1/10000, the precision of an indoor data set NYU-Depth-v2 is within 30cm, and the precision of an outdoor data set KITTI is within 4 m.

Description

Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image

Technical Field

The invention belongs to the technical field of robot vision positioning navigation, and particularly relates to a real-time dense depth estimation method based on sparse measurement and monocular RGB images.

Background

The dense depth estimation plays an important role in the fields of unmanned aerial vehicles, intelligent navigation, augmented reality and the like. The current mainstream depth acquisition solution consists of a high resolution camera and a low resolution depth sensor, which are generally expensive and do not achieve dense depth, and therefore are not practical for most applications. Furthermore, the accuracy and reliability of RGB-based depth estimation is far from practical, although research efforts over a decade have been devoted to improvements by deep learning methods. Therefore, high precision dense real-time depth estimation of single images and sparse depth measurements acquired by monocular cameras and low resolution depth sensors is of great significance.

One major advantage of the sparse sample based approach over the problem of depth estimation from only one RGB image or grayscale image is that the sparse depth measurements can be considered as part of the output truth values, however, most of the current sparse sample based depth estimation approaches follow a similar network design as the single frame RGB image based approach, which results in an under-utilization of sparse information. Aiming at the problem, the invention tries to use a self-attention mechanism and long and short dense jump connection to further improve the depth estimation precision based on sparse samples, in addition, in the past, the research on monocular depth estimation almost focuses on improving the precision, so that a calculation-intensive algorithm cannot be easily adopted in a robot system, and as most of systems have limited calculation and storage resources, particularly for tiny equipment, a key challenge is to balance the operation time cost and the precision of the algorithm.

Disclosure of Invention

In order to solve the problems, the invention discloses a real-time dense depth estimation method based on sparse measurement and monocular RGB images, which utilizes a self-attention mechanism, a dense jump connection and a depth supervision mode to improve the performance of a sparse sample depth estimation task, balances the network prediction precision and the inference speed to the maximum extent, and obtains the efficiency maximization.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a real-time dense depth estimation method based on sparse measurement and monocular RGB images comprises the following steps:

(1) extracting information from sparse depth measurement by adopting a self-attention mechanism, thereby improving the depth estimation precision;

(2) by the long-short jump connection technology, the difference between the low-dimensional characteristic and the high-dimensional characteristic is reduced, and the network convergence speed is increased;

(3) and a lightweight network design for rapid depth estimation is realized by utilizing a depth supervision technology.

The depth feature extraction based on the self-attention mechanism in the step (1)

The present invention employs a self-attention mechanism for improving the accuracy of sparse sample-based depth estimation. The self-attention mechanism is able to focus on the exact eigenvalues and convey useful information during the convolution stage. The network combined with self-attention may give different weights to different input pixels instead of having all pixels as valid information. The depth feature extraction method of the self-attention mechanism for sparse measurement and RGB image depth estimation provided by the invention is represented as follows:

Attention_y，x＝∑∑Weights_a·Input

Intermediate_y，x＝∑∑Weights_i·InPut

in the formula, Weights_aAnd Weights_iRepresenting different convolution kernels,. alpha.representing pixel-based multiplication,. sigma.

Representing activation functions (e.g., ReLU, ELU and LeakyReLU)

Operated by a door

As an implementation of the self-attention mechanism, the network is made to pay attention to the characteristic meaning of each spatial position and channel, and the depth is madeThe model enables efficient dynamic feature selection. Due to the Attention_y，xThe method can learn and identify the area containing useful information, and the important information of the feature map is reserved in the output according to the above-mentioned formula model, so that the self-attention convolution layer can pay attention to extract more local and detail information, and the depth value can be predicted more accurately.

Step (2) the long and short jump connection based on Unet ++

In order to reduce the semantic difference between feature maps, the invention adds a long and short jump connection mechanism in Unet + +, and connects a series of sub-networks of an encoder and a decoder. These series of nested, dense skip-joins can incorporate image details of the high resolution feature map in the encoder into features within the decoder, helping the decoding layer to reconstruct a more detailed dense output.

The invention adopts a long jump connection method used in the Unet + + network, and adds short jump connection to expand the network. In a specific form, a residual network block (ResBlock) is adopted to replace a convolution network block in the original Unet + +. Experimental results show that ResBlock can not only improve the convergence speed during training, but also improve the precision of depth estimation during testing. In addition, in conjunction with the self-attention mechanism in step (1), the network in the present invention is designed as self-attention Unet + +, as shown in FIG. 4.

Step (3) lightweight network pruning method based on deep supervision mechanism

The invention directly supervises the hidden layer by using a depth supervision method to ensure that self-attention modules in different levels have the capability of influencing the prediction result of the full-scale depth map. Another major purpose of the combination of the deep supervision method and the self-attention unnet + + is that it provides a new approach to lightweight network design. By the method, the completely trained self-attention Unet + + can be divided into four modes in the test process, as shown in FIG. 5, and the self-attention Unet + + network can generate a multi-level full-resolution depth map { Output + + by combining with a network architecture and a depth supervision method of Unet + +^0，jJ ∈ {1, 2, 3, 4} }. In practical use, these separate networks may be based on specific requirementsTo select from the above four modes to achieve maximum task performance.

The invention has the beneficial effects that:

the method improves the performance of a sparse sample depth estimation task by utilizing a self-attention mechanism, a long and short dense jump connection and a depth supervision mode. The self-attention mechanism and the long and short jump connection in Unet + + enable the network to focus on precise feature values in the convolution stage and deliver useful information to improve the accuracy of depth prediction. By combining a deep supervision technology, the self-attention Unet + + can be split into a series of sub-networks, and the method can be flexibly applied in practical application to pursue maximization of task performance.

Drawings

FIG. 1 is a schematic flow diagram of a real-time dense depth estimation method based on sparse measurement and monocular RGB images;

FIG. 2 is a graph of the predicted effect of the method herein on the NYU-Depth-v2 data set; (a) an RGB image; (b)200 sparse depth measurements; (c) depth truth value; (d) prediction of AttUnet + + M4;

FIG. 3 is a graph of the predicted effect of the present method on KITTI data sets; (a) an RGB image; (b)200 sparse depth measurements; (c) depth truth value; (d) prediction of AttUnet + + M4;

FIG. 4 is a schematic diagram of a self-attention Unet + + network architecture;

fig. 5 is a branch diagram of the selection of four depth estimates in architectures of different complexity.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.

The method provided by the invention uses an indoor data set NYU-Depth-v2 and an outdoor data set KITTI as our experimental data set, and verifies the real-time dense Depth estimation method based on sparse measurement and monocular RGB images. The experimental platforms included pytorch0.4.1, python3.6, ubuntu16.04, and NVIDIA TiTanV GPUs. The NYU-Depth-v2 dataset consists of high quality 480X 640RGB and Depth data collected by Kinect. Based on the official splitting of the data, 249 scenes contained 26331 pictures for training and 215 scenes contained 654 pictures for testing. The KITTI mapping data set consists of 22 sequences, including camera and lidar measurements. 46000 training sequence images of the binocular RGB camera are used in the training stage, and 3200 test sequence images are used in the testing stage. The original NYU-Depth-v2 image was downsampled to 224 x 224 size, while the KITTI mapping image was cropped to 224 x 336 due to GPU memory limitations. FIGS. 2 and 3 are graphs of the predicted effect of the method herein on NYU-Depth-v2 and KITTI data sets; table 1, table 2 the results of the four patterns AttUnet + + M1, AttUnet + + M2, AttUnet + + M3 and AttUnet + + M4 of the methods herein were tested on NYU-Depth-v2 and KITTI datasets. The experimental result shows that when the sparse sampling rate is 1/10000, the Depth estimation precision of the NYU-Depth-v2 outdoor data set is less than 4m, and the Depth estimation precision of the KITTI odometer outdoor data set is less than 7 m.

TABLE 1 results of four patterns tested on the NYU-Depth-v2 dataset

Table 2 results of testing four patterns on the KITTI dataset

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features.

Claims

1. A real-time dense depth estimation method based on sparse measurement and monocular RGB images is characterized in that: the method comprises the following steps:

2. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the depth feature extraction method based on the self-attention mechanism in the step (1) is represented as follows:

Attention_y，x＝ΣΣWeights_a·Input

Intermediate_y，x＝ΣΣWeights_i·Input

Representing an activation function.

3. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the specific method for the long and short jump connection based on the Unet + + in the step (2) is as follows:

a long jump connection method is adopted in the Unet + + network, and short jump connections are added to expand the network; the specific form is that a residual error network block is adopted to replace a convolution network block in the original Unet + +; in addition, in conjunction with the self-attention mechanism in step (1), the network is designed to be self-attention Unet + +.

4. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the lightweight network pruning method based on the deep supervision mechanism in the step (3) comprises the following steps:

dividing the trained AttUnet + + into four modes in the test process, combining the network architecture of Unet + + and the depth supervision method, and generating a multi-level full-resolution depth map { Output } by the AttUnet + + network^0，jJ is e {1, 2, 3, 4} }; in practical use, these separate networks are selected from the above four modes according to specific requirements to obtain maximum task performance.