CN115496788A

CN115496788A - Deep completion method using airspace propagation post-processing module

Info

Publication number: CN115496788A
Application number: CN202211215408.1A
Authority: CN
Inventors: 颜成钢; 杨智文; 张杰华; 李亮; 陈楚翘; 高宇涵; 胡冀; 孙垚棋; 王鸿奎; 朱尊杰; 殷海兵; 张继勇; 李宗鹏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2022-12-20

Abstract

The invention discloses a depth completion method using an airspace propagation post-processing module, which comprises the steps of firstly determining a depth estimation network model; then determining an airspace propagation post-processing module; training the depth estimation network added with the airspace propagation post-processing module; and finally completing deep completion through the trained model. Compared with the traditional monocular depth estimation and completion network, the method and the system have the advantages that an additional post-processing process is added, so that the model can more fully utilize the precise sparse depth information from the LiDAR, and the effect of enabling the depth completion result to be more accurate is achieved.

Description

Deep completion method using airspace propagation post-processing module

Technical Field

The invention belongs to the field of computer vision, particularly relates to a depth complementing method using a space domain propagation post-processing module, and aims at a depth sensing system.

Background

In recent years, with the rapid growth of computer vision applications, depth estimation of a single image, i.e., prediction of the distance of each pixel from the camera, has become an important issue. It has wide application in fields such as augmented reality, unmanned aerial vehicle control, autopilot and motion planning. In order to obtain a reliable depth prediction, information from various sensors is utilized, such as RGB cameras, radar, lidar and ultrasonic sensors. Depth sensors, such as LiDAR sensors, may produce accurate depth measurements at high frequencies. However, due to hardware limitations, such as the number of scan channels, the density of acquisition depths is typically sparse. To overcome these limitations, the method of estimating dense depth information from a given sparse depth value is referred to as depth completion.

An affinity matrix is a generic matrix that determines the distance or degree of similarity of two points in space. In computer vision tasks, it is a weighted graph that treats each pixel as a node and connects each pair of pixels by an edge. The weights on the edges reflect the similarity of pairs of pixels in different tasks. For example, for low-level visual tasks, such as image filtering, affinity values should reveal low-level coherence of color and texture; for medium and high level visual tasks, such as image matting and segmentation, the affinity measure should preserve the semantic level of pairwise similarity. The sparse depth acquired by the LiDAR sensor may be propagated to surrounding pixels using an affinity matrix. Therefore, the airspace propagation post-processing module can perform post-processing on the result of depth completion by combining the learned data of the affinity moment array and the LiDAR sensor, and a more accurate depth estimation effect is achieved.

Disclosure of Invention

The present invention primarily considers that depth sensing and estimation is critical in a wide range of engineering applications with the rapid growth of computer vision applications. However, existing depth sensors, including lidar, structured light-based depth sensors, and stereo cameras, have their own limitations. For example, top-level 3D LiDAR is costly (up to $ 75,000 per unit cost) and only provides sparse measurements of distant objects. Structured light based depth sensors (e.g., kinect) are sensitive to sunlight and consume power, and have short ranging distances. Therefore, how to reduce the depth estimation cost and improve the estimation accuracy is a problem worth discussing.

Aiming at the defects in the prior art, the invention provides a deep completion method using a space domain propagation post-processing module. Aiming at the problem of monocular depth completion, the method optimizes the traditional monocular depth estimation network. In order to cooperate with a subsequent spatial domain propagation post-processing module, the depth estimation network is led to output an affinity diagram (affinity matrix) additionally, namely the output of the traditional monocular depth estimation network is set from 1 channel to 9 channels, wherein 1 channel is used as the depth diagram, and the other 8 channels are used for representing the affinity diagram. Therefore, the spatial domain propagation post-processing module can optimize the estimation result by combining the affinity information learned by the network and the sparse depth map. The estimated result is more accurate through the post-processing method, only the common LiDAR is used for acquiring sparse depth information, and the depth estimation cost can be effectively reduced by combining the common RGB camera.

A deep completion method using an airspace propagation post-processing module comprises the following specific steps:

step 1, determining a depth estimation network model;

step 2, determining a space domain propagation post-processing module;

step 3, training the depth estimation network with the airspace propagation post-processing module;

step 4, completing deep completion through the trained model;

further, the specific method of step 1 is as follows;

the encoder part uses a ResNet50 neural network, the decoder part uses standard upsampling, and each residual block needs to be in numerical communication with a corresponding upsampling layer, namely, skip connection is added. The two parts constitute a depth estimation network that receives RGB maps from the cameras and sparse depth maps from the LiDAR and outputs an initial depth map and an affinity map.

Further, the specific method of step 2 is as follows;

and (3) the spatial domain propagation post-processing module adopts a linear propagation mode, propagates the depth value of each pixel in the initial depth map to the periphery according to the affinity map output by the network model determined in the step (1), and finally replaces the depth value of the corresponding position by the sparse depth map. And the process of propagation therein is implemented using recursive convolution operations.

Depth map D given the output of a depth estimation network ₀ The convolution transformation function formula with convolution kernel size k at each iteration is as follows:

wherein,

in which the kernel is transformed

Is the output of the spatial domain propagation post-processing module spatially dependent on the input image, t denotes the t-th timeIteration, κ _i,j Is a normalized transform kernel, i and j denote the ith row and jth column in the image, a and b denote κ _i,j Is (e.g., a =0 and b =0 represent the center position of the kernel, a =1 and b =1 represent the lower right corner position of the kernel), and thus D represents the center position of the kernel, and D represents the lower right corner position of the kernel _i,j,t+1 Representing depth values, D, at the ith row and jth column of the depth map at the t +1 th iteration _i-a,j-b,t Indicating a depth value in row i-a and column j-b of the depth map at the t-th iteration, an element-by-element multiplication. The size k of the convolution kernel is set to 3 here, and is set to an odd number in order to make the pixels around each pixel (i, j) symmetrical. The weight of the convolution kernel is determined by the affinity diagram of step 1, and normalization processing of the (-1, 1) interval is performed on the weight of the convolution kernel in order to make the model stably converge. Then, in order to ensure that the depth after the post-processing has the same value in the corresponding pixels in the sparse depth map, the following is implemented:

wherein m is _i,j Is a discriminant function used to discriminate whether the pixel is from LiDAR acquisition, if so, the value is 1, otherwise, it is 0.

Represents depth values from locations corresponding to sparse depth maps of LiDAR acquisitions.

Further, the specific method in step 3 is as follows;

the training platform employs a Pythrch. And respectively adopting an NYU v2 indoor data set and a KITTI outdoor data set to train the depth estimation network added with the airspace propagation post-processing module. Wherein the weight of the ResNet-50 network model is initialized using results pre-trained on the ImageNet dataset; the optimizer adopts a random gradient descent SGD optimizer, the batch size is set to be 12, and the iteration number is set to be 40; the learning rate was initialized to 0.01 and reduced by 20% every 10 iteration cycles; weight decay is set to 0.00001 for regularization; 500 pixel points are collected on the raw data to simulate the LiDAR sampling effect for training.

Further, the specific method in step 4 is as follows;

the trained complete model needs to receive two parts of input: RGB video images from a camera and sparse depth information from LiDAR. The method comprises the steps of firstly obtaining an initial dense depth map and a corresponding affinity map through a depth estimation network, and then obtaining a final dense depth map through a spatial domain propagation post-processing module.

The invention has the following beneficial effects:

compared with the traditional monocular depth estimation and completion network, the method and the system have the advantages that an additional post-processing process is added, so that the model can more fully utilize the precise sparse depth information from the LiDAR, and the effect of enabling the depth completion result to be more accurate is achieved.

Drawings

FIG. 1 is a diagram of an overall network model of the present invention;

FIG. 2 is a schematic representation of the spatial propagation operation of the post-processing portion of the present invention;

Detailed Description

The present invention will be described in detail with reference to the following embodiments.

The deep completion method using the airspace propagation post-processing network is implemented according to the following steps.

Step 1, determining a depth estimation neural network model

The encoder part uses the classical ResNet-50 network model, the decoder part uses standard upsampling, and each residual block is passed numerically with the corresponding upsampling layer, i.e. a skip connection is added. The overall model is shown in fig. 1, wherein after the ResNet-50 network model performs convolution operation on the input, 4 different residual blocks are needed to pass through, and finally classification is completed through an average pooling layer and a full connection layer through a classification function. Where the last fully connected layer and the sorted portion need to be removed and then followed by 4 upsampled layers. The up-sampling method is realized by a bilinear interpolation method of a Pythroch platform. And a skip connection is added between each pair of coding and decoding layers.

The depth estimation network receives RGB maps from the cameras and sparse depth maps from LiDAR and outputs an initial depth map and an affinity map.

Step 2, determining a space domain propagation post-processing module

And (3) the spatial domain propagation post-processing module propagates (diffuses) the depth value of each pixel in the initial depth map to the periphery according to the affinity map output by the network model determined in the step (1) in a linear propagation mode, and finally replaces the depth value of the corresponding position by the sparse depth map. And the process of propagation therein is implemented using recursive convolution operations. The convolution operation is used in practical applications because it can be efficiently implemented by image vectorization, thus being competent for real-time depth estimation tasks. As shown in FIG. 2, spatial domain propagation is similar to an anisotropic diffusion process, and the affinity diagram for diffusion will be learned by the depth estimation network model to guide the refinement of the output depth map. Depth map D given the output of a depth estimation network ₀ For each iteration t, the convolution transformation function formula with a convolution kernel size of k is as follows:

wherein,

wherein the kernel is transformed

Is the output of a spatial domain propagation module, κ, spatially dependent on the input image _i,j Is a normalized transform kernel, i and j denote the ith row and jth column in the image, a and b denote κ _i,j Relative positions (e.g., a =0 and b =0 represent the center position of the kernel, and a =1 and b =1 represent the lower right corner position of the kernel). The kernel size k is here set to 3, and the odd number is set in order to make the context around each pixel (i, j) symmetric. In order to stably converge the model, normalization processing is performed for the (-1, 1) interval on the weight of the kernel. Then, in order to ensure that the depth after the post-processing has the same value in the corresponding pixels in the sparse depth map, the following is performed:

wherein m is _i,j And the function is a discriminant function used for distinguishing whether the pixel comes from the pixel acquired by the LiDAR, if so, the value is 1, and if not, the value is 0.

Represents a sparse depth map from LiDAR acquisition.

Step 3, model training

The training platform employs a Pythrch. And respectively adopting an NYU v2 indoor data set and a KITTI outdoor data set to train the whole network model.

The NYU v2 dataset includes RGB and depth images collected from 464 different indoor scenes. Official data segmentation is used in the present invention, using 249 scenes for training and taking 50K image samples from the training set. The test was performed using a 654 image with a label. The original image size is 640 × 480, down-sampled to half size and center-cropped, and the final input image size of the network is 304 × 228.

The KITTI data set consists of 22 sequences of data measured by cameras and lidar (a depth sensor that can also produce sparse depth maps). One half of the sequence is used for training and the other half is used for testing. Training was performed using all 46K images in the training sequence, and 3200 images were randomly drawn from the test sequence for testing. Further, since the top area of the image has no depth, the present invention uses the bottom 912 × 228 sized image as the input to the network.

The weight of the ResNet-50 network model of the encoder portion of the network model is initialized using results pre-trained on the ImageNet dataset; the optimizer adopts a random gradient descent SGD optimizer, the batch size is set to be 12, and the iteration number is set to be 40; the learning rate was initialized to 0.01 and reduced by 20% every 10 iteration cycles; weight decay is set to 0.00001 for regularization; 500 pixel points are collected on the original data to simulate the LiDAR sampling effect for training.

Step 4, model using method

Claims

1. A deep completion method using a space domain propagation post-processing module is characterized by comprising the following steps:

step 1, determining a depth estimation network model;

step 2, determining an airspace propagation post-processing module;

step 3, training the depth estimation network added with the airspace propagation post-processing module;

and 4, completing deep completion through the trained model.

2. The method for depth completion by using the spatial domain propagation post-processing module as claimed in claim 1, wherein the specific method of step 1 is as follows;

the encoder part uses a ResNet50 neural network, the decoder part uses standard upsampling, and each residual block needs to carry out numerical value transmission with a corresponding upsampling layer, namely skip connection is added; these two parts constitute a depth estimation network that receives the RGB map from the camera and the sparse depth map from the LiDAR and outputs an initial depth map and an affinity map.

3. The method for depth completion by using the spatial domain propagation post-processing module as claimed in claim 2, wherein the step 2 is as follows;

the spatial domain propagation post-processing module adopts a linear propagation mode, propagates the depth value of each pixel in the initial depth map to the periphery according to the affinity map output by the network model determined in the step 1, and finally replaces the depth value of the corresponding position by the sparse depth map; and the process of propagation is realized by using a recursive convolution operation mode;

depth map D given the output of a depth estimation network ₀ The convolution transformation function formula with a convolution kernel size of k at each iteration is as follows:

wherein,

in which the kernel is transformed

Is the output of a spatial domain propagation post-processing module that is spatially dependent on the input image, t denotes the t-th iteration, k _i,j Is a normalized transform kernel, i and j denote the ith row and jth column in the image, a and b denote kappa _i,j Relative position in (1), thus D _i,j,t+1 Denotes the depth value at the ith row and jth column of the depth map at the t +1 th iteration, D _i-a,j-b,t A depth value indicating that at the t-th iteration located in row i-a and column j-b of the depth map indicates a element-by-element multiplication; the size k of the convolution kernel is set to 3 here, and is set to an odd number in order to make the pixels around each pixel (i, j) symmetrical; the weight value of the convolution kernel is determined by the affinity diagram in the step 1, and in order to make the model stably converge, normalization processing in the (-1, 1) interval is carried out on the weight value of the convolution kernel; then, in order to ensure that the depth after the post-processing has the same value in the corresponding pixels in the sparse depth map, the following is performed:

wherein m is _i,j Is a discriminant function for discriminating whether the pixel is from LiDAR acquisition, if so, the value is 1, otherwise, the value is 0;

depth values representing corresponding locations from sparse depth maps of LiDAR acquisition.

4. The method for depth completion by using the space domain propagation post-processing module as claimed in claim 3, wherein the specific method in step 3 is as follows;

the training platform adopts a Pythrch; respectively adopting an NYU v2 indoor data set and a KITTI outdoor data set to train the depth estimation network added with the airspace propagation post-processing module; wherein the weight of the ResNet-50 network model is initialized using results pre-trained on the ImageNet dataset; the optimizer adopts a random gradient descent SGD optimizer, the batch size is set to be 12, and the iteration number is set to be 40; the learning rate was initialized to 0.01 and reduced by 20% every 10 iteration cycles; weight decay is set to 0.00001 for regularization; 500 pixel points are collected on the original data to simulate the LiDAR sampling effect for training.

5. The method for depth completion by using the spatial domain propagation post-processing module as claimed in claim 4, wherein the detailed method in step 4 is as follows;

the trained complete model needs to receive two parts of input: RGB video images from a camera and sparse depth information from LiDAR; firstly, an initial dense depth map and a corresponding affinity map are obtained through a depth estimation network, and a final dense depth map is obtained through a spatial domain propagation post-processing module.