CN116468768B

CN116468768B - Scene depth completion method based on conditional variation self-encoder and geometric guidance

Info

Publication number: CN116468768B
Application number: CN202310422520.0A
Authority: CN
Inventors: 魏明强; 吴鹏; 燕雪峰
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-10-17
Anticipated expiration: 2043-04-20
Also published as: CN116468768A

Abstract

The invention discloses a scene depth completion method based on a conditional variation self-encoder and geometric guidance, which comprises the following steps: acquiring a color image, a sparse depth map and a dense depth map under an automatic driving scene; the method comprises the steps of designing a condition variation self-encoder with a priori network and a posterior network, inputting a color image and a sparse depth map into the priori network to extract features, and inputting the color image, the sparse depth map and the dense depth map into the posterior network to extract features; and converting the sparse depth map into point cloud by using camera internal parameters, namely focal length and optical center coordinates, extracting a geometric space feature from a point cloud up-sampling model, and mapping the geometric space feature back to the sparse depth map. The invention can solve the problem that the data acquired by the laser radar is too sparse, so that the low-cost laser radar with fewer wire harnesses can obtain more accurate and dense depth information, and provides a cost-effective solution for industries such as automatic driving, robot environment sensing and the like which need accurate and dense depth data.

Description

Scene depth completion method based on conditional variation self-encoder and geometric guidance

Technical Field

The invention relates to the technical field of depth map completion, in particular to a scene depth completion method based on a conditional variation self-encoder and geometric guidance.

Background

Human perception, understanding, and experience of the surrounding environment relies on visually acquired three-dimensional scene information. The computer vision simulates the human behavior, and various sensors are used as vision organs to acquire scene information, so that the scene is identified and understood, wherein the depth information plays a key role in the fields of robots, automatic driving, augmented reality and the like. In the field of automatic driving, the distance between the current vehicle and other vehicles, pedestrians, obstacles and the like needs to be perceived in the driving process, and the full-automatic Level5 is required to have the distance measuring capability accurate to centimeters. Currently, liDAR (LiDAR) is the primary active distance sensor in autopilot. Compared with a two-dimensional RGB image acquired by a color camera, a depth map acquired by a laser radar (the depth map and a point cloud can be mutually converted through camera internal parameters) has an accurate depth distance, so that the position information of a 3D target in the surrounding environment can be accurately perceived. However, one laser radar can only emit a limited laser beam with 16 lines, 32 lines or 64 lines in the vertical direction, so that the acquired point cloud density is extremely sparse (the pixels of the effective depth value only account for about 5% of the color image), and serious influence is brought to downstream tasks such as 3D target detection, three-dimensional environment perception and the like.

Disclosure of Invention

The invention aims to provide a scene depth completion method based on a conditional variation self-encoder and geometric guidance, so as to solve the key problems of data sparseness and missing caused by the existing depth imaging equipment such as a laser radar.

In order to achieve the above purpose, the present invention provides the following technical solutions: a scene depth completion method based on conditional variation self-encoder and geometric guidance, comprising the steps of:

acquiring a color image, a sparse depth map and a dense depth map under an automatic driving scene;

the method comprises the steps of designing a condition variation self-encoder with a priori network and a posterior network, inputting a color image and a sparse depth map into the priori network to extract features, and inputting the color image, the sparse depth map and the dense depth map into the posterior network to extract features;

converting the sparse depth map into point cloud by utilizing camera internal parameters or focal length, extracting a point cloud up-sampling model to geometric space features, and mapping the geometric space features back to the sparse depth map;

the fusion of the image characteristics and the point cloud characteristics is realized by adopting a dynamic image message transmission module;

generating a preliminary depth complement diagram by using a U-shaped coder decoder based on a residual error network;

and inputting the preliminary predicted complement depth map to a confidence uncertainty estimation module to realize final depth complement optimization.

Preferably, the acquiring the color image and the sparse depth map in the autopilot scene includes:

capturing a color image and a sparse depth map in an autopilot scene using a color camera and a lidar;

sparse depth maps are changed into dense depth maps by using sparsityInvariantCNNs algorithm as real tag auxiliary training.

Preferably, the design has a condition-variable self-encoder of a priori network and a posterior network, the color image and the sparse depth map are input into the priori network to extract features, and the color image, the sparse depth map and the dense depth map are input into the posterior network to extract features, including:

the feature extraction module based on the ResNet structure designs a priori network and a posterior network with the same structure as a condition variation self-encoder;

inputting the color image and the sparse depth map into a priori network to extract the feature map primary of the last layer, inputting the color image, the sparse depth map and the real label into a Posterior network to extract the feature map Posterior of the last layer, and then respectively calculating the mean value and the variance of the primary and Posterior feature maps to obtain the probability distribution D of the respective features ₁ And D ₂ Then use the Kullback-Leibler divergence loss functionDigital supervision distribution D ₁ And D ₂ The loss between the two causes the prior network to learn the real label characteristics of the posterior network.

Preferably, the converting the sparse depth map into a point cloud by using a camera reference or focal length, extracting a geometric spatial feature from a point cloud up-sampling model, and mapping the geometric spatial feature back to the sparse depth map includes:

sparse depth image pixels (u) _i ,v _i ) Converting the pixel coordinate system into the camera coordinate system to obtain the point cloud coordinate (x _i ,y _i ,z _i ) Forming sparse point cloud data S;

wherein (c) _x ,c _y ) Is the optical center coordinate of the camera, f _x ，f _y Focal lengths of the camera in x-axis and y-axis directions, d _i Is (u) _i ,v _i ) Depth values at the locations, for a real tag depth map, a dense tag point cloud S is also generated using the above formula ₁ ；

The method comprises the steps of randomly sampling point clouds for a plurality of times to obtain different numbers of point cloud sets, aggregating 16 nearest points around each point by utilizing a KNN nearest node algorithm aiming at each point cloud set, and inputting the 16 nearest points into a geometric perception neural network to extract local geometric features of the point;

the sparse point cloud features extracted from each point are combined with the original point cloud coordinates (x _i ,y _i ,z _i ) Adding to obtain a point cloud coding characteristic Q, and inputting the Q into a quadruple up-sampling multi-layer perceptron network to obtain a predicted dense point cloud S ₂ Computing a true dense point cloud S using a Chamfer Distance loss function ₁ And predicted dense point cloud S ₂ The specific calculation formula of the CD loss is as follows:

wherein the first term represents S ₁ Any point x to S ₂ The second term represents S ₂ Any point y to S ₁ Is the sum of the minimum distances of (a) and (b).

Preferably, the implementation of the fusion of the image features and the point cloud features by using the dynamic image message propagation module includes:

designing two coding networks with the same structure, wherein an encoder consists of a 5-layer ResNet module, and inputting a color image and a sparse depth map into an RGB (red, green and blue) branch encoder to extract five feature maps L with different scales ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting the point cloud characteristics Q and the sparse depth map into a point cloud branch encoder to extract characteristic maps P with five different scales ₁ ，P ₂ ，P ₃ ，P ₄ And P ₅ ；

For L ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Obtaining pixel points of different receptive fields by adopting a cavity convolution mode, and exploring the coordinate offset of each pixel point by utilizing deformable convolution to dynamically aggregate the characteristic values with strong surrounding correlation of each pixel point to obtain a characteristic T ₁ ，T ₂ ，T ₃ ，T ₄ And T ₅ ；

T to be rich in dynamic diagram features ₁ ，T ₂ ，T ₃ ，T ₄ ，T ₅ Adding to point cloud coding feature map P ₁ ，P ₂ ，P ₃ ，P ₄ ，P ₅ Obtaining a point cloud characteristic map M containing semantic information and geometric information ₁ ，M ₂ ，M ₃ ，M ₄ ，M ₅ 。

Preferably, the generating a preliminary depth complement map using a U-shaped codec based on a residual network includes:

designing a corresponding multi-scale decoder structure according to the encoder structure to form a U-Net structured encoder-decoder network;

feature map L generated by branching RGB ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting U-Net to predict the Depth of the first rough Depth complement map ₁ And confidence map C ₁ ；

Feature map M generated by branching point cloud ₁ ，M ₂ ，M ₃ ，M ₄ And M ₅ Inputting U-Net to predict second coarse Depth full-complement map Depth ₂ And confidence map C ₂ 。

Preferably, the inputting the preliminary predicted complement depth map to the confidence uncertainty estimation module, to implement final depth complement optimization, includes:

confidence map C to be generated ₁ And C ₂ Adding to obtain a characteristic diagram C, carrying out uncertainty prediction on the characteristic diagram C by using a Softmax function, and predicting an uncertainty proportion F of each confidence diagram pixel by pixel ₁ And F ₂ ；

Map of uncertainty F ₁ And F ₂ Depth of roughness and Depth of roughness ₁ And Depth ₂ Multiplying to obtain the final optimized depth complement diagram.

Compared with the prior art, the invention has the beneficial effects that:

the feature distribution in the true dense depth map is learned by utilizing the condition variation self-encoder, the color image and the sparse depth map are guided to generate more valuable depth features, the space structural features under different modes are captured by utilizing the point cloud features of the three-dimensional space, the geometric perception capability of a network is enhanced, auxiliary information is provided for predicting more accurate depth values, and the dynamic map message propagation module skillfully fuses the features between the color image and the point cloud to realize high-precision depth complement prediction, so that the problem that data acquired by a laser radar is too sparse can be overcome, the low-cost laser radar with fewer wire harnesses can obtain more accurate dense depth information, and a cost-effective solution is provided for industries needing accurate dense depth data such as automatic driving, robot environment perception and the like.

Drawings

FIG. 1 is a flow chart of a scene depth completion method based on a conditional variation self-encoder and geometric guidance provided by an embodiment of the invention;

fig. 2 is a depth complement diagram of a scene depth complement method based on a conditional variation self-encoder and geometric guidance according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The main execution body of the method in this embodiment is a terminal, and the terminal may be a device such as a mobile phone, a tablet computer, a PDA, a notebook or a desktop, but of course, may be another device with a similar function, and this embodiment is not limited thereto.

Referring to fig. 1 and 2, the present invention provides a scene depth completion method based on a conditional variation self-encoder and geometric guidance, which is applied to automatic driving scene depth completion, and includes:

step S1, a color image, a sparse depth map and a dense depth map in an automatic driving scene are obtained.

Specifically, step S1 further includes the following steps:

s101, capturing a color image and a sparse depth map in an automatic driving scene by using a color camera and a laser radar;

s102, changing the sparse depth map into a dense depth map by using a Sparsity Invariant CNNs algorithm as real tag auxiliary training.

The automatic driving vehicle is mainly provided with a color camera and a laser radar, RGB images and depth images are respectively acquired, and a full depth image is required to be additionally generated as a training label in the method, and the specific steps are as follows:

capturing color images and depth images in an autopilot scenario using a color camera and a Velodyne HDL-64E lidar; the sparse depth map is changed into a dense depth map as a real label by using a Sparsity Invariant CNNs algorithm.

And S2, designing a conditional variation self-encoder with a priori network and a posterior network, inputting the color image and the sparse depth map into the priori network to extract the features, and inputting the color image, the sparse depth map and the dense depth map into the posterior network to extract the features.

Specifically, step S2 further includes the following steps:

s201, designing a priori network and a posterior network with the same structure as a condition variation self-encoder based on a feature extraction module of a ResNet structure;

s202, inputting the color image and the sparse depth map into a priori network to extract a feature map primary of the last layer, inputting the color image, the sparse depth map and the real label into a Posterior network to extract a feature map Posterior of the last layer, and then respectively calculating the mean value and the variance of the primary and Posterior feature maps to obtain probability distribution D of respective features ₁ And D ₂ Supervision distribution D by using Kullback-Leibler divergence loss function ₁ And D ₂ The loss between the two causes the prior network to learn the real label characteristics of the posterior network.

And S3, converting the sparse depth map into point clouds by utilizing camera internal parameters or focal lengths, extracting a geometric space feature from a point cloud up-sampling model, and mapping the geometric space feature back to the sparse depth map.

Specifically, step S3 further includes the following steps:

s301, sparse depth image pixels (u) _i ,v _i ) Converting the pixel coordinate system into the camera coordinate system to obtain the point cloud coordinate (x _i ,y _i ,z _i ) Forming sparse point cloud data S;

S302, carrying out random sampling on point clouds for a plurality of times to obtain point cloud sets with different numbers, aggregating 16 nearest points around each point by utilizing a KNN nearest neighbor node algorithm aiming at each point cloud set, and inputting the 16 nearest points into a geometric sense neural network to extract local geometric features of the point;

s303, extracting sparse point cloud features of each point and original point cloud coordinates (x _i ,y _i ,z _i ) Adding to obtain a point cloud coding characteristic Q, and inputting the Q into a quadruple up-sampling multi-layer perceptron network to obtain a predicted dense point cloud S ₂ Computing a true dense point cloud S using a Chamfer Distance loss function ₁ And predicted dense point cloud S ₂ The specific calculation formula of the CD loss is as follows:

And S4, realizing fusion of the image characteristics and the point cloud characteristics by adopting a dynamic image message transmission module.

Specifically, step S4 further includes the following steps:

s401, designing two coding networks with the same structure, wherein an encoder consists of a 5-layer ResNet module, and inputting a color image and a sparse depth map into an RGB (red, green and blue) branch encoder to extract five feature maps L with different scales ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting the point cloud characteristics Q and the sparse depth map into a point cloud branch encoder to extract characteristic maps P with five different scales ₁ ，P ₂ ，P ₃ ，P ₄ And P ₅ ；

S402, for L ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Obtaining pixel points of different receptive fields by adopting a cavity convolution mode, and exploring the coordinate offset of each pixel point by utilizing deformable convolution to dynamically aggregate the characteristic values with strong surrounding correlation of each pixel point to obtain a characteristic T ₁ ，T ₂ ，T ₃ ，T ₄ And T ₅ ；

S403, T which is rich in dynamic diagram features ₁ ，T ₂ ，T ₃ ，T ₄ ，T ₅ Adding to point cloud coding feature map P ₁ ，P ₂ ，P ₃ ，P ₄ ，P ₅ Obtaining a point cloud characteristic map M containing semantic information and geometric information ₁ ，M ₂ ，M ₃ ，M ₄ ，M ₅ 。

And S5, generating a preliminary depth complement diagram by using a U-shaped codec based on a residual network.

Specifically, step S5 further includes the following steps:

s501, designing a corresponding multi-scale decoder structure according to the encoder structure in the step S4, and forming a U-Net structured encoder-decoder network;

s502, feature map L generated by RGB branch ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting U-Net to predict the Depth of the first rough Depth complement map ₁ And confidence map C ₁ ；

S503, generating a feature map M by branching point clouds ₁ ，M ₂ ，M ₃ ，M ₄ And M ₅ Inputting U-Net to predict second coarse Depth full-complement map Depth ₂ And confidence map C ₂ 。

And S6, inputting the preliminarily predicted complement depth map to a confidence uncertainty estimation module to realize final depth complement optimization.

Specifically, step S6 further includes the following steps:

s601, the confidence map C generated in the step S5 ₁ And C ₂ Adding to obtain a characteristic diagram C, carrying out uncertainty prediction on the characteristic diagram C by using a Softmax function, and predicting an uncertainty proportion F of each confidence diagram pixel by pixel ₁ And F ₂ ；

S602, uncertainty map F ₁ And F ₂ Depth of roughness and Depth of roughness ₁ And Depth ₂ Multiplying to obtain the final optimized depth complement diagram.

In the embodiment, firstly, feature distribution in a true dense depth map is learned by a conditional variation self-encoder, a color image and a sparse depth map are guided to generate more valuable depth features, secondly, space structure features under different modes are captured by utilizing point cloud features of a three-dimensional space, the geometric perceptibility of a network is enhanced, auxiliary information is provided for predicting more accurate depth values, and a dynamic map message propagation module skillfully fuses the features between the color image and the point cloud to realize high-precision depth complement prediction.

In addition, it should be noted that the combination of the technical features described in the present invention is not limited to the combination described in the claims or the combination described in the specific embodiments, and all the technical features described in the present invention may be freely combined or combined in any manner unless contradiction occurs between them.

It should be noted that the above-mentioned embodiments are merely examples of the present invention, and it is obvious that the present invention is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

The foregoing is merely illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The scene depth completion method based on the conditional variation self-encoder and the geometric guidance is characterized by comprising the following steps of:

the dynamic image information transmission module is adopted to realize the fusion of image characteristics and point cloud characteristics, and the specific steps of the fusion are as follows: designing two coding networks with the same structure, wherein an encoder consists of a 5-layer ResNet module, and inputting a color image and a sparse depth map into an RGB (red, green and blue) branch encoder to extract five feature maps L with different scales ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting the point cloud characteristics Q and the sparse depth map into a point cloud branch encoder to extract characteristic maps P with five different scales ₁ ，P ₂ ，P ₃ ，P ₄ And P ₅ The method comprises the steps of carrying out a first treatment on the surface of the For L ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Obtaining pixel points of different receptive fields by adopting a cavity convolution mode, and exploring the coordinate offset of each pixel point by utilizing deformable convolution to dynamically aggregate the characteristic values with strong surrounding correlation of each pixel point to obtain a characteristic T ₁ ，T ₂ ，T ₃ ，T ₄ And T ₅ The method comprises the steps of carrying out a first treatment on the surface of the T to be rich in dynamic diagram features ₁ ，T ₂ ，T ₃ ，T ₄ ，T ₅ Adding to point cloud coding feature map P ₁ ，P ₂ ，P ₃ ，P ₄ ，P ₅ Obtaining a point cloud characteristic map M containing semantic information and geometric information ₁ ，M ₂ ，M ₃ ，M ₄ ，M ₅ ；

2. The scene depth completion method based on conditional variance self-encoder and geometric guidance of claim 1, wherein the acquiring of color image and sparse depth map in an autopilot scene comprises:

the sparse depth map is changed into a dense depth map by using a Sparsity Invariant CNNs algorithm to serve as a real tag to assist training.

3. The scene depth completion method based on conditional variance self-encoder and geometric guidance according to claim 2, wherein the designing the conditional variance self-encoder with a priori network and a posterior network, inputting the color image and sparse depth map into the a priori network to extract features, and then inputting the color image, sparse depth map and dense depth map into the posterior network to extract features, comprises:

inputting the color image and the sparse depth map into a priori network to extract the feature map primary of the last layer, inputting the color image, the sparse depth map and the real label into a Posterior network to extract the feature map Posterior of the last layer, and then respectively calculating the mean value and the variance of the primary and Posterior feature maps to obtain the probability distribution D of the respective features ₁ And D ₂ Supervision distribution D by using Kullback-Leibler divergence loss function ₁ And D ₂ The loss between the two causes the prior network to learn the real label characteristics of the posterior network.

4. A scene depth completion method based on a conditional variation self-encoder and geometric guidance according to claim 3, wherein said converting a sparse depth map into a point cloud using camera intrinsic parameters or focal lengths, extracting a point cloud upsampling model to geometric spatial features, and mapping back onto the sparse depth map comprises:

wherein the first term represents S ₁ Any point x to S ₂ Sum of minimum distances of (2), second termRepresent S ₂ Any point y to S ₁ Is the sum of the minimum distances of (a) and (b).

5. The conditional variance self-encoder and geometry guided scene depth completion method of claim 1, wherein generating a preliminary depth complement map using a residual network-based U-shaped codec comprises:

6. The scene depth completion method based on conditional variance self-encoder and geometric guidance of claim 1, wherein said inputting the preliminary predicted completion depth map to the confidence uncertainty estimation module, achieves final depth completion optimization, comprises: