CN112132880A - Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image - Google Patents

Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image Download PDF

Info

Publication number
CN112132880A
CN112132880A CN202010910048.1A CN202010910048A CN112132880A CN 112132880 A CN112132880 A CN 112132880A CN 202010910048 A CN202010910048 A CN 202010910048A CN 112132880 A CN112132880 A CN 112132880A
Authority
CN
China
Prior art keywords
depth
network
real
sparse
depth estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010910048.1A
Other languages
Chinese (zh)
Other versions
CN112132880B (en
Inventor
潘树国
赵涛
高旺
魏建胜
盛超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010910048.1A priority Critical patent/CN112132880B/en
Publication of CN112132880A publication Critical patent/CN112132880A/en
Application granted granted Critical
Publication of CN112132880B publication Critical patent/CN112132880B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a real-time dense depth estimation method based on sparse measurement and monocular RGB images, which adopts a self-attention mechanism and a long and short dense skip connection technology to extract more useful information from sparse depth measurement. Meanwhile, a lightweight network design method for real-time depth estimation is provided by combining a depth supervision technology. The experimental result verifies the effectiveness of the self-attention mechanism and the long and short dense jump connection technology and the deep supervision technology. Experimental results show that the method provided by the invention can balance the network prediction precision and the reasoning speed to the maximum extent so as to obtain the maximum efficiency. By adopting the Depth error estimated in real time by the method, under the condition that the sparse sampling rate is less than 1/10000, the precision of an indoor data set NYU-Depth-v2 is within 30cm, and the precision of an outdoor data set KITTI is within 4 m.

Description

Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image
Technical Field
The invention belongs to the technical field of robot vision positioning navigation, and particularly relates to a real-time dense depth estimation method based on sparse measurement and monocular RGB images.
Background
The dense depth estimation plays an important role in the fields of unmanned aerial vehicles, intelligent navigation, augmented reality and the like. The current mainstream depth acquisition solution consists of a high resolution camera and a low resolution depth sensor, which are generally expensive and do not achieve dense depth, and therefore are not practical for most applications. Furthermore, the accuracy and reliability of RGB-based depth estimation is far from practical, although research efforts over a decade have been devoted to improvements by deep learning methods. Therefore, high precision dense real-time depth estimation of single images and sparse depth measurements acquired by monocular cameras and low resolution depth sensors is of great significance.
One major advantage of the sparse sample based approach over the problem of depth estimation from only one RGB image or grayscale image is that the sparse depth measurements can be considered as part of the output truth values, however, most of the current sparse sample based depth estimation approaches follow a similar network design as the single frame RGB image based approach, which results in an under-utilization of sparse information. Aiming at the problem, the invention tries to use a self-attention mechanism and long and short dense jump connection to further improve the depth estimation precision based on sparse samples, in addition, in the past, the research on monocular depth estimation almost focuses on improving the precision, so that a calculation-intensive algorithm cannot be easily adopted in a robot system, and as most of systems have limited calculation and storage resources, particularly for tiny equipment, a key challenge is to balance the operation time cost and the precision of the algorithm.
Disclosure of Invention
In order to solve the problems, the invention discloses a real-time dense depth estimation method based on sparse measurement and monocular RGB images, which utilizes a self-attention mechanism, a dense jump connection and a depth supervision mode to improve the performance of a sparse sample depth estimation task, balances the network prediction precision and the inference speed to the maximum extent, and obtains the efficiency maximization.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a real-time dense depth estimation method based on sparse measurement and monocular RGB images comprises the following steps:
(1) extracting information from sparse depth measurement by adopting a self-attention mechanism, thereby improving the depth estimation precision;
(2) by the long-short jump connection technology, the difference between the low-dimensional characteristic and the high-dimensional characteristic is reduced, and the network convergence speed is increased;
(3) and a lightweight network design for rapid depth estimation is realized by utilizing a depth supervision technology.
The depth feature extraction based on the self-attention mechanism in the step (1)
The present invention employs a self-attention mechanism for improving the accuracy of sparse sample-based depth estimation. The self-attention mechanism is able to focus on the exact eigenvalues and convey useful information during the convolution stage. The network combined with self-attention may give different weights to different input pixels instead of having all pixels as valid information. The depth feature extraction method of the self-attention mechanism for sparse measurement and RGB image depth estimation provided by the invention is represented as follows:
Attentiony,x=∑∑Weightsa·Input
Intermediatey,x=∑∑Weightsi·InPut
Figure BDA0002662932700000021
in the formula, WeightsaAnd WeightsiRepresenting different convolution kernels,. alpha.representing pixel-based multiplication,. sigma.
Figure BDA0002662932700000022
Representing activation functions (e.g., ReLU, ELU and LeakyReLU)
Operated by a door
Figure BDA0002662932700000023
As an implementation of the self-attention mechanism, the network is made to pay attention to the characteristic meaning of each spatial position and channel, and the depth is madeThe model enables efficient dynamic feature selection. Due to the Attentiony,xThe method can learn and identify the area containing useful information, and the important information of the feature map is reserved in the output according to the above-mentioned formula model, so that the self-attention convolution layer can pay attention to extract more local and detail information, and the depth value can be predicted more accurately.
Step (2) the long and short jump connection based on Unet ++
In order to reduce the semantic difference between feature maps, the invention adds a long and short jump connection mechanism in Unet + +, and connects a series of sub-networks of an encoder and a decoder. These series of nested, dense skip-joins can incorporate image details of the high resolution feature map in the encoder into features within the decoder, helping the decoding layer to reconstruct a more detailed dense output.
The invention adopts a long jump connection method used in the Unet + + network, and adds short jump connection to expand the network. In a specific form, a residual network block (ResBlock) is adopted to replace a convolution network block in the original Unet + +. Experimental results show that ResBlock can not only improve the convergence speed during training, but also improve the precision of depth estimation during testing. In addition, in conjunction with the self-attention mechanism in step (1), the network in the present invention is designed as self-attention Unet + +, as shown in FIG. 4.
Step (3) lightweight network pruning method based on deep supervision mechanism
The invention directly supervises the hidden layer by using a depth supervision method to ensure that self-attention modules in different levels have the capability of influencing the prediction result of the full-scale depth map. Another major purpose of the combination of the deep supervision method and the self-attention unnet + + is that it provides a new approach to lightweight network design. By the method, the completely trained self-attention Unet + + can be divided into four modes in the test process, as shown in FIG. 5, and the self-attention Unet + + network can generate a multi-level full-resolution depth map { Output + + by combining with a network architecture and a depth supervision method of Unet + +0,jJ ∈ {1, 2, 3, 4} }. In practical use, these separate networks may be based on specific requirementsTo select from the above four modes to achieve maximum task performance.
The invention has the beneficial effects that:
the method improves the performance of a sparse sample depth estimation task by utilizing a self-attention mechanism, a long and short dense jump connection and a depth supervision mode. The self-attention mechanism and the long and short jump connection in Unet + + enable the network to focus on precise feature values in the convolution stage and deliver useful information to improve the accuracy of depth prediction. By combining a deep supervision technology, the self-attention Unet + + can be split into a series of sub-networks, and the method can be flexibly applied in practical application to pursue maximization of task performance.
Drawings
FIG. 1 is a schematic flow diagram of a real-time dense depth estimation method based on sparse measurement and monocular RGB images;
FIG. 2 is a graph of the predicted effect of the method herein on the NYU-Depth-v2 data set; (a) an RGB image; (b)200 sparse depth measurements; (c) depth truth value; (d) prediction of AttUnet + + M4;
FIG. 3 is a graph of the predicted effect of the present method on KITTI data sets; (a) an RGB image; (b)200 sparse depth measurements; (c) depth truth value; (d) prediction of AttUnet + + M4;
FIG. 4 is a schematic diagram of a self-attention Unet + + network architecture;
fig. 5 is a branch diagram of the selection of four depth estimates in architectures of different complexity.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.
The method provided by the invention uses an indoor data set NYU-Depth-v2 and an outdoor data set KITTI as our experimental data set, and verifies the real-time dense Depth estimation method based on sparse measurement and monocular RGB images. The experimental platforms included pytorch0.4.1, python3.6, ubuntu16.04, and NVIDIA TiTanV GPUs. The NYU-Depth-v2 dataset consists of high quality 480X 640RGB and Depth data collected by Kinect. Based on the official splitting of the data, 249 scenes contained 26331 pictures for training and 215 scenes contained 654 pictures for testing. The KITTI mapping data set consists of 22 sequences, including camera and lidar measurements. 46000 training sequence images of the binocular RGB camera are used in the training stage, and 3200 test sequence images are used in the testing stage. The original NYU-Depth-v2 image was downsampled to 224 x 224 size, while the KITTI mapping image was cropped to 224 x 336 due to GPU memory limitations. FIGS. 2 and 3 are graphs of the predicted effect of the method herein on NYU-Depth-v2 and KITTI data sets; table 1, table 2 the results of the four patterns AttUnet + + M1, AttUnet + + M2, AttUnet + + M3 and AttUnet + + M4 of the methods herein were tested on NYU-Depth-v2 and KITTI datasets. The experimental result shows that when the sparse sampling rate is 1/10000, the Depth estimation precision of the NYU-Depth-v2 outdoor data set is less than 4m, and the Depth estimation precision of the KITTI odometer outdoor data set is less than 7 m.
TABLE 1 results of four patterns tested on the NYU-Depth-v2 dataset
Figure BDA0002662932700000041
Table 2 results of testing four patterns on the KITTI dataset
Figure BDA0002662932700000042
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features.

Claims (4)

1. A real-time dense depth estimation method based on sparse measurement and monocular RGB images is characterized in that: the method comprises the following steps:
(1) extracting information from sparse depth measurement by adopting a self-attention mechanism, thereby improving the depth estimation precision;
(2) by the long-short jump connection technology, the difference between the low-dimensional characteristic and the high-dimensional characteristic is reduced, and the network convergence speed is increased;
(3) and a lightweight network design for rapid depth estimation is realized by utilizing a depth supervision technology.
2. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the depth feature extraction method based on the self-attention mechanism in the step (1) is represented as follows:
Attentiony,x=ΣΣWeightsa·Input
Intermediatey,x=ΣΣWeightsi·Input
Figure FDA0002662932690000012
in the formula, WeightsaAnd WeightsiRepresenting different convolution kernels,. alpha.representing pixel-based multiplication,. sigma.
Figure FDA0002662932690000011
Representing an activation function.
3. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the specific method for the long and short jump connection based on the Unet + + in the step (2) is as follows:
a long jump connection method is adopted in the Unet + + network, and short jump connections are added to expand the network; the specific form is that a residual error network block is adopted to replace a convolution network block in the original Unet + +; in addition, in conjunction with the self-attention mechanism in step (1), the network is designed to be self-attention Unet + +.
4. The real-time dense depth estimation method based on sparse measurement and monocular RGB images as claimed in claim 1, wherein: the lightweight network pruning method based on the deep supervision mechanism in the step (3) comprises the following steps:
dividing the trained AttUnet + + into four modes in the test process, combining the network architecture of Unet + + and the depth supervision method, and generating a multi-level full-resolution depth map { Output } by the AttUnet + + network0,jJ is e {1, 2, 3, 4} }; in practical use, these separate networks are selected from the above four modes according to specific requirements to obtain maximum task performance.
CN202010910048.1A 2020-09-02 2020-09-02 Real-time dense depth estimation method based on sparse measurement and monocular RGB image Active CN112132880B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010910048.1A CN112132880B (en) 2020-09-02 2020-09-02 Real-time dense depth estimation method based on sparse measurement and monocular RGB image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010910048.1A CN112132880B (en) 2020-09-02 2020-09-02 Real-time dense depth estimation method based on sparse measurement and monocular RGB image

Publications (2)

Publication Number Publication Date
CN112132880A true CN112132880A (en) 2020-12-25
CN112132880B CN112132880B (en) 2024-05-03

Family

ID=73848921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010910048.1A Active CN112132880B (en) 2020-09-02 2020-09-02 Real-time dense depth estimation method based on sparse measurement and monocular RGB image

Country Status (1)

Country Link
CN (1) CN112132880B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819876A (en) * 2021-02-13 2021-05-18 西北工业大学 Monocular vision depth estimation method based on deep learning
CN112907573A (en) * 2021-03-25 2021-06-04 东南大学 Depth completion method based on 3D convolution

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685842A (en) * 2018-12-14 2019-04-26 电子科技大学 A kind of thick densification method of sparse depth based on multiple dimensioned network
CN110956655A (en) * 2019-12-09 2020-04-03 清华大学 Dense depth estimation method based on monocular image

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685842A (en) * 2018-12-14 2019-04-26 电子科技大学 A kind of thick densification method of sparse depth based on multiple dimensioned network
CN110956655A (en) * 2019-12-09 2020-04-03 清华大学 Dense depth estimation method based on monocular image

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819876A (en) * 2021-02-13 2021-05-18 西北工业大学 Monocular vision depth estimation method based on deep learning
CN112819876B (en) * 2021-02-13 2024-02-27 西北工业大学 Monocular vision depth estimation method based on deep learning
CN112907573A (en) * 2021-03-25 2021-06-04 东南大学 Depth completion method based on 3D convolution
CN112907573B (en) * 2021-03-25 2022-04-29 东南大学 Depth completion method based on 3D convolution

Also Published As

Publication number Publication date
CN112132880B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN104200237B (en) One kind being based on the High-Speed Automatic multi-object tracking method of coring correlation filtering
CN111862213A (en) Positioning method and device, electronic equipment and computer readable storage medium
CN109558832A (en) A kind of human body attitude detection method, device, equipment and storage medium
CN108830185B (en) Behavior identification and positioning method based on multi-task joint learning
CN111105439B (en) Synchronous positioning and mapping method using residual attention mechanism network
CN114820655B (en) Weak supervision building segmentation method taking reliable area as attention mechanism supervision
WO2022141718A1 (en) Method and system for assisting point cloud-based object detection
CN112907573B (en) Depth completion method based on 3D convolution
CN112132880B (en) Real-time dense depth estimation method based on sparse measurement and monocular RGB image
CN111310609B (en) Video target detection method based on time sequence information and local feature similarity
CN109657538B (en) Scene segmentation method and system based on context information guidance
CN116935332A (en) Fishing boat target detection and tracking method based on dynamic video
Li et al. Blinkflow: A dataset to push the limits of event-based optical flow estimation
CN114693744A (en) Optical flow unsupervised estimation method based on improved cycle generation countermeasure network
CN113901931A (en) Knowledge distillation model-based behavior recognition method for infrared and visible light videos
CN116805360B (en) Obvious target detection method based on double-flow gating progressive optimization network
CN111161323B (en) Complex scene target tracking method and system based on correlation filtering
CN116452654B (en) BEV perception-based relative pose estimation method, neural network and training method thereof
CN115861756A (en) Earth background small target identification method based on cascade combination network
CN115358962A (en) End-to-end visual odometer method and device
CN115496788A (en) Deep completion method using airspace propagation post-processing module
CN114820723A (en) Online multi-target tracking method based on joint detection and association
CN114202587A (en) Visual feature extraction method based on shipborne monocular camera
CN114693951A (en) RGB-D significance target detection method based on global context information exploration
Liu et al. L2-LiteSeg: A Real-Time Semantic Segmentation Method for End-to-End Autonomous Driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant