CN112132880A - Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image - Google Patents
Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image Download PDFInfo
- Publication number
- CN112132880A CN112132880A CN202010910048.1A CN202010910048A CN112132880A CN 112132880 A CN112132880 A CN 112132880A CN 202010910048 A CN202010910048 A CN 202010910048A CN 112132880 A CN112132880 A CN 112132880A
- Authority
- CN
- China
- Prior art keywords
- depth
- network
- real
- sparse
- depth estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000005259 measurement Methods 0.000 title claims abstract description 21
- 230000007246 mechanism Effects 0.000 claims abstract description 19
- 238000013461 design Methods 0.000 claims abstract description 5
- 238000012360 testing method Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 claims description 2
- 238000013138 pruning Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 abstract description 2
- 238000013459 approach Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a real-time dense depth estimation method based on sparse measurement and monocular RGB images, which adopts a self-attention mechanism and a long and short dense skip connection technology to extract more useful information from sparse depth measurement. Meanwhile, a lightweight network design method for real-time depth estimation is provided by combining a depth supervision technology. The experimental result verifies the effectiveness of the self-attention mechanism and the long and short dense jump connection technology and the deep supervision technology. Experimental results show that the method provided by the invention can balance the network prediction precision and the reasoning speed to the maximum extent so as to obtain the maximum efficiency. By adopting the Depth error estimated in real time by the method, under the condition that the sparse sampling rate is less than 1/10000, the precision of an indoor data set NYU-Depth-v2 is within 30cm, and the precision of an outdoor data set KITTI is within 4 m.
Description
Technical Field
The invention belongs to the technical field of robot vision positioning navigation, and particularly relates to a real-time dense depth estimation method based on sparse measurement and monocular RGB images.
Background
The dense depth estimation plays an important role in the fields of unmanned aerial vehicles, intelligent navigation, augmented reality and the like. The current mainstream depth acquisition solution consists of a high resolution camera and a low resolution depth sensor, which are generally expensive and do not achieve dense depth, and therefore are not practical for most applications. Furthermore, the accuracy and reliability of RGB-based depth estimation is far from practical, although research efforts over a decade have been devoted to improvements by deep learning methods. Therefore, high precision dense real-time depth estimation of single images and sparse depth measurements acquired by monocular cameras and low resolution depth sensors is of great significance.
One major advantage of the sparse sample based approach over the problem of depth estimation from only one RGB image or grayscale image is that the sparse depth measurements can be considered as part of the output truth values, however, most of the current sparse sample based depth estimation approaches follow a similar network design as the single frame RGB image based approach, which results in an under-utilization of sparse information. Aiming at the problem, the invention tries to use a self-attention mechanism and long and short dense jump connection to further improve the depth estimation precision based on sparse samples, in addition, in the past, the research on monocular depth estimation almost focuses on improving the precision, so that a calculation-intensive algorithm cannot be easily adopted in a robot system, and as most of systems have limited calculation and storage resources, particularly for tiny equipment, a key challenge is to balance the operation time cost and the precision of the algorithm.
Disclosure of Invention
In order to solve the problems, the invention discloses a real-time dense depth estimation method based on sparse measurement and monocular RGB images, which utilizes a self-attention mechanism, a dense jump connection and a depth supervision mode to improve the performance of a sparse sample depth estimation task, balances the network prediction precision and the inference speed to the maximum extent, and obtains the efficiency maximization.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a real-time dense depth estimation method based on sparse measurement and monocular RGB images comprises the following steps:
(1) extracting information from sparse depth measurement by adopting a self-attention mechanism, thereby improving the depth estimation precision;
(2) by the long-short jump connection technology, the difference between the low-dimensional characteristic and the high-dimensional characteristic is reduced, and the network convergence speed is increased;
(3) and a lightweight network design for rapid depth estimation is realized by utilizing a depth supervision technology.
The depth feature extraction based on the self-attention mechanism in the step (1)
The present invention employs a self-attention mechanism for improving the accuracy of sparse sample-based depth estimation. The self-attention mechanism is able to focus on the exact eigenvalues and convey useful information during the convolution stage. The network combined with self-attention may give different weights to different input pixels instead of having all pixels as valid information. The depth feature extraction method of the self-attention mechanism for sparse measurement and RGB image depth estimation provided by the invention is represented as follows:
Attentiony,x=∑∑Weightsa·Input
Intermediatey,x=∑∑Weightsi·InPut
in the formula, WeightsaAnd WeightsiRepresenting different convolution kernels,. alpha.representing pixel-based multiplication,. sigma.Representing activation functions (e.g., ReLU, ELU and LeakyReLU)
Operated by a doorAs an implementation of the self-attention mechanism, the network is made to pay attention to the characteristic meaning of each spatial position and channel, and the depth is madeThe model enables efficient dynamic feature selection. Due to the Attentiony,xThe method can learn and identify the area containing useful information, and the important information of the feature map is reserved in the output according to the above-mentioned formula model, so that the self-attention convolution layer can pay attention to extract more local and detail information, and the depth value can be predicted more accurately.
Step (2) the long and short jump connection based on Unet ++
In order to reduce the semantic difference between feature maps, the invention adds a long and short jump connection mechanism in Unet + +, and connects a series of sub-networks of an encoder and a decoder. These series of nested, dense skip-joins can incorporate image details of the high resolution feature map in the encoder into features within the decoder, helping the decoding layer to reconstruct a more detailed dense output.
The invention adopts a long jump connection method used in the Unet + + network, and adds short jump connection to expand the network. In a specific form, a residual network block (ResBlock) is adopted to replace a convolution network block in the original Unet + +. Experimental results show that ResBlock can not only improve the convergence speed during training, but also improve the precision of depth estimation during testing. In addition, in conjunction with the self-attention mechanism in step (1), the network in the present invention is designed as self-attention Unet + +, as shown in FIG. 4.
Step (3) lightweight network pruning method based on deep supervision mechanism
The invention directly supervises the hidden layer by using a depth supervision method to ensure that self-attention modules in different levels have the capability of influencing the prediction result of the full-scale depth map. Another major purpose of the combination of the deep supervision method and the self-attention unnet + + is that it provides a new approach to lightweight network design. By the method, the completely trained self-attention Unet + + can be divided into four modes in the test process, as shown in FIG. 5, and the self-attention Unet + + network can generate a multi-level full-resolution depth map { Output + + by combining with a network architecture and a depth supervision method of Unet + +0,jJ ∈ {1, 2, 3, 4} }. In practical use, these separate networks may be based on specific requirementsTo select from the above four modes to achieve maximum task performance.
The invention has the beneficial effects that:
the method improves the performance of a sparse sample depth estimation task by utilizing a self-attention mechanism, a long and short dense jump connection and a depth supervision mode. The self-attention mechanism and the long and short jump connection in Unet + + enable the network to focus on precise feature values in the convolution stage and deliver useful information to improve the accuracy of depth prediction. By combining a deep supervision technology, the self-attention Unet + + can be split into a series of sub-networks, and the method can be flexibly applied in practical application to pursue maximization of task performance.
Drawings
FIG. 1 is a schematic flow diagram of a real-time dense depth estimation method based on sparse measurement and monocular RGB images;
FIG. 2 is a graph of the predicted effect of the method herein on the NYU-Depth-v2 data set; (a) an RGB image; (b)200 sparse depth measurements; (c) depth truth value; (d) prediction of AttUnet + + M4;
FIG. 3 is a graph of the predicted effect of the present method on KITTI data sets; (a) an RGB image; (b)200 sparse depth measurements; (c) depth truth value; (d) prediction of AttUnet + + M4;
FIG. 4 is a schematic diagram of a self-attention Unet + + network architecture;
fig. 5 is a branch diagram of the selection of four depth estimates in architectures of different complexity.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.
The method provided by the invention uses an indoor data set NYU-Depth-v2 and an outdoor data set KITTI as our experimental data set, and verifies the real-time dense Depth estimation method based on sparse measurement and monocular RGB images. The experimental platforms included pytorch0.4.1, python3.6, ubuntu16.04, and NVIDIA TiTanV GPUs. The NYU-Depth-v2 dataset consists of high quality 480X 640RGB and Depth data collected by Kinect. Based on the official splitting of the data, 249 scenes contained 26331 pictures for training and 215 scenes contained 654 pictures for testing. The KITTI mapping data set consists of 22 sequences, including camera and lidar measurements. 46000 training sequence images of the binocular RGB camera are used in the training stage, and 3200 test sequence images are used in the testing stage. The original NYU-Depth-v2 image was downsampled to 224 x 224 size, while the KITTI mapping image was cropped to 224 x 336 due to GPU memory limitations. FIGS. 2 and 3 are graphs of the predicted effect of the method herein on NYU-Depth-v2 and KITTI data sets; table 1, table 2 the results of the four patterns AttUnet + + M1, AttUnet + + M2, AttUnet + + M3 and AttUnet + + M4 of the methods herein were tested on NYU-Depth-v2 and KITTI datasets. The experimental result shows that when the sparse sampling rate is 1/10000, the Depth estimation precision of the NYU-Depth-v2 outdoor data set is less than 4m, and the Depth estimation precision of the KITTI odometer outdoor data set is less than 7 m.
TABLE 1 results of four patterns tested on the NYU-Depth-v2 dataset
Table 2 results of testing four patterns on the KITTI dataset
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features.
Claims (4)
1. A real-time dense depth estimation method based on sparse measurement and monocular RGB images is characterized in that: the method comprises the following steps:
(1) extracting information from sparse depth measurement by adopting a self-attention mechanism, thereby improving the depth estimation precision;
(2) by the long-short jump connection technology, the difference between the low-dimensional characteristic and the high-dimensional characteristic is reduced, and the network convergence speed is increased;
(3) and a lightweight network design for rapid depth estimation is realized by utilizing a depth supervision technology.
2. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the depth feature extraction method based on the self-attention mechanism in the step (1) is represented as follows:
Attentiony,x=ΣΣWeightsa·Input
Intermediatey,x=ΣΣWeightsi·Input
3. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the specific method for the long and short jump connection based on the Unet + + in the step (2) is as follows:
a long jump connection method is adopted in the Unet + + network, and short jump connections are added to expand the network; the specific form is that a residual error network block is adopted to replace a convolution network block in the original Unet + +; in addition, in conjunction with the self-attention mechanism in step (1), the network is designed to be self-attention Unet + +.
4. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the lightweight network pruning method based on the deep supervision mechanism in the step (3) comprises the following steps:
dividing the trained AttUnet + + into four modes in the test process, combining the network architecture of Unet + + and the depth supervision method, and generating a multi-level full-resolution depth map { Output } by the AttUnet + + network0,jJ is e {1, 2, 3, 4} }; in practical use, these separate networks are selected from the above four modes according to specific requirements to obtain maximum task performance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010910048.1A CN112132880B (en) | 2020-09-02 | 2020-09-02 | Real-time dense depth estimation method based on sparse measurement and monocular RGB image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010910048.1A CN112132880B (en) | 2020-09-02 | 2020-09-02 | Real-time dense depth estimation method based on sparse measurement and monocular RGB image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112132880A true CN112132880A (en) | 2020-12-25 |
CN112132880B CN112132880B (en) | 2024-05-03 |
Family
ID=73848921
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010910048.1A Active CN112132880B (en) | 2020-09-02 | 2020-09-02 | Real-time dense depth estimation method based on sparse measurement and monocular RGB image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112132880B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819876A (en) * | 2021-02-13 | 2021-05-18 | 西北工业大学 | Monocular vision depth estimation method based on deep learning |
CN112907573A (en) * | 2021-03-25 | 2021-06-04 | 东南大学 | Depth completion method based on 3D convolution |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109685842A (en) * | 2018-12-14 | 2019-04-26 | 电子科技大学 | A kind of thick densification method of sparse depth based on multiple dimensioned network |
CN110956655A (en) * | 2019-12-09 | 2020-04-03 | 清华大学 | Dense depth estimation method based on monocular image |
-
2020
- 2020-09-02 CN CN202010910048.1A patent/CN112132880B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109685842A (en) * | 2018-12-14 | 2019-04-26 | 电子科技大学 | A kind of thick densification method of sparse depth based on multiple dimensioned network |
CN110956655A (en) * | 2019-12-09 | 2020-04-03 | 清华大学 | Dense depth estimation method based on monocular image |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819876A (en) * | 2021-02-13 | 2021-05-18 | 西北工业大学 | Monocular vision depth estimation method based on deep learning |
CN112819876B (en) * | 2021-02-13 | 2024-02-27 | 西北工业大学 | Monocular vision depth estimation method based on deep learning |
CN112907573A (en) * | 2021-03-25 | 2021-06-04 | 东南大学 | Depth completion method based on 3D convolution |
CN112907573B (en) * | 2021-03-25 | 2022-04-29 | 东南大学 | Depth completion method based on 3D convolution |
Also Published As
Publication number | Publication date |
---|---|
CN112132880B (en) | 2024-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104200237B (en) | One kind being based on the High-Speed Automatic multi-object tracking method of coring correlation filtering | |
CN111862213A (en) | Positioning method and device, electronic equipment and computer readable storage medium | |
CN109558832A (en) | A kind of human body attitude detection method, device, equipment and storage medium | |
CN108830185B (en) | Behavior identification and positioning method based on multi-task joint learning | |
CN111105439B (en) | Synchronous positioning and mapping method using residual attention mechanism network | |
CN114820655B (en) | Weak supervision building segmentation method taking reliable area as attention mechanism supervision | |
WO2022141718A1 (en) | Method and system for assisting point cloud-based object detection | |
CN112907573B (en) | Depth completion method based on 3D convolution | |
CN112132880B (en) | Real-time dense depth estimation method based on sparse measurement and monocular RGB image | |
CN111310609B (en) | Video target detection method based on time sequence information and local feature similarity | |
CN109657538B (en) | Scene segmentation method and system based on context information guidance | |
CN116935332A (en) | Fishing boat target detection and tracking method based on dynamic video | |
Li et al. | Blinkflow: A dataset to push the limits of event-based optical flow estimation | |
CN114693744A (en) | Optical flow unsupervised estimation method based on improved cycle generation countermeasure network | |
CN113901931A (en) | Knowledge distillation model-based behavior recognition method for infrared and visible light videos | |
CN116805360B (en) | Obvious target detection method based on double-flow gating progressive optimization network | |
CN111161323B (en) | Complex scene target tracking method and system based on correlation filtering | |
CN116452654B (en) | BEV perception-based relative pose estimation method, neural network and training method thereof | |
CN115861756A (en) | Earth background small target identification method based on cascade combination network | |
CN115358962A (en) | End-to-end visual odometer method and device | |
CN115496788A (en) | Deep completion method using airspace propagation post-processing module | |
CN114820723A (en) | Online multi-target tracking method based on joint detection and association | |
CN114202587A (en) | Visual feature extraction method based on shipborne monocular camera | |
CN114693951A (en) | RGB-D significance target detection method based on global context information exploration | |
Liu et al. | L2-LiteSeg: A Real-Time Semantic Segmentation Method for End-to-End Autonomous Driving |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |