CN116862965A

CN116862965A - Depth completion method based on sparse representation

Info

Publication number: CN116862965A
Application number: CN202310836476.8A
Authority: CN
Inventors: 杨敬钰; 董津铭; 岳焕景; 李坤
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-07-08
Filing date: 2023-07-08
Publication date: 2023-10-10

Abstract

The invention discloses a depth completion method based on sparse representation, and belongs to the technical field of image processing. The invention designs a self-adaptive sampling mode for capturing important sampling points, which is beneficial to reconstructing a denser depth map by a network, and specifically comprises the following steps: s1, outputting an uncertainty graph from an RGB image through a sampling network; s2, acquiring a sampled sparse depth map based on an uncertainty sampling process; s3, building a reconstructed neural network structure, inputting the RGB image and the sampled sparse depth map into a reconstruction network, training, and recovering a dense depth map.

Description

Depth completion method based on sparse representation

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a depth complement method based on sparse representation.

Background

In recent years, with the continuous development of the field of computer vision, it is important to obtain effective depth information for the fields of automatic driving, robots, augmented reality and the like. In an actual scene, depth information is often acquired through an RGB-D depth camera or a laser radar and other modes, for the laser radar, due to the limitation of hardware equipment, the depth map of an image plane projected by point cloud acquired by original laser radar scanning is sparse, and more depth maps are needed in a real application scene, so that environment information can be perceived conveniently. The process of deriving a dense depth map from sparse depth maps is called depth complement.

For the deep complement task, with the vigorous development of deep learning, many scientific researchers have studied and innovated on the task in recent years, and the task is mainly divided into two aspects: on one hand, only the sparse depth map is used as the input of the neural network, and a dense depth map is reconstructed; on the other hand, the rich semantic information of RGB is combined as a guide to realize the completion work of the sparse depth map. In an actual application scene, the method is more popular by combining the semantic information of RGB as a guiding method for depth completion. In general, the RGB picture and the sparse depth map are directly used as inputs of the network, and if a better effect can be reconstructed by using relatively few sampling points, the cost is saved, so that the cost performance is higher. Previous researchers have typically chosen to randomly sample the original real depth map to obtain some sparse samples, but the samples obtained in this case do not reflect well the edge information of some objects in the real depth map, so there are few flaws in the reconstructed effect.

In summary, the invention provides a depth complement method based on sparse representation.

Disclosure of Invention

The invention aims to provide a depth complement method based on sparse representation to solve the problems in the background technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a depth completion method based on sparse representation specifically comprises the following steps:

s1, selecting RGB images from a plurality of disclosed depth complement data sets, and inputting the RGB images into a sampling network to output an uncertainty map;

s2, acquiring a sampled sparse depth map based on a sampling process of the uncertainty map;

s3, designing a reconstructed neural network structure, building an image reconstruction network, inputting the RGB image and the sampled sparse depth map obtained in the S2 into the reconstruction network together for training, and recovering a dense depth map.

Preferably, the sampling network in S1 adopts a U-net type structural design and consists of an encoder and a decoder;

the encoder comprises four residual modules, wherein each residual module consists of a feature extraction and downsampling module and a feature holding module, and the feature extraction and downsampling module is used for extracting high-latitude features of an image and completing downsampling of the image; the feature holding module is used for ensuring that the feature map is further deepened under the condition that the resolution is not lost;

the decoder consists of four receptive field dominant up-sampling modules;

a jump connection is arranged between the encoder and the decoder to further promote feature fusion, and finally an uncertainty graph is output.

Preferably, the uncertainty graph represents a sampling logic relationship with high uncertainty, i.e. low probability of representing sampling; based on this, the S2 specifically includes the following:

s2.1, assuming that the size of the depth map D is M multiplied by n, defining elements in the binarized sampling mask M of the depth map D as:

in the formula ,p_i,j A sampling probability of a pixel representing the (i, j) position;

the sampling process of the depth map D is defined as:

S＝D·M

wherein S represents a sparse depth map after sampling of the depth map D; dot product representing pixel level;

s2.2, assuming that the size of the uncertainty map P is also M n, generating a binary sampling mask M' from the uncertainty map based on S2.1: firstly, generating a random matrix R with m multiplied by n, wherein the element values in the matrix are random values between 0 and 1, and marking the random value as R _i,j ∈[0,1]；

S2.3, the sampling probability of the pixel of the element in the uncertainty diagram is denoted as p' _i,j ，p’ _i,j ∈[0,1]The method comprises the steps of carrying out a first treatment on the surface of the Will r _i,j And p' _i,j Comparing to satisfy r _i,j ≤p’ _i,j The corresponding binary mask position for the condition is set to 1, otherwise to 0, expressed as:

s2.4, based on the contents from S2.1 to S2.3, defining a sampling process of the uncertainty map P as follows:

S’＝P·M’

wherein S' represents a sparse depth map after the uncertainty map P is sampled; dot product representing pixel level.

Preferably, the reconstruction network in the S3 consists of a multi-scale convolution module, a parallel double-flow encoding-decoding structure and an adaptive fusion mechanism, and dense depth map reconstruction is realized by the reconstruction network through cross-channel information interaction;

the loss function design of the sampling part in the reconstruction network training process specifically comprises the following contents:

(1) l2Loss of Loss function L _prob : the l is _prob The specific function used to supervise the generated uncertainty graph is expressed as:

wherein ,representing a Sobel gradient operation symbol;

(2) regular loss function l _reg : the l is _reg To constrain the training process, the specific functions are expressed as:

wherein N represents the total number of depth image pixels available; s represents the sampling point number for sampling;

(3) sampling part total loss function _sample : the l is _sample The specific function is expressed as:

l _sample ＝l _prob +αl _reg

wherein α represents a weighting coefficient;

the loss function design of the reconstruction part in the reconstruction network training process specifically comprises the following contents:

(1) reconstructed depth map D ^* Gradient loss term L based on L1 loss between the depth map D and the true depth map D _grad : the l is _grad To reduce the error in computing depth gradients, the specific function is expressed as:

(2) surface normal loss function l _norm : the l is _norm To further deepen the localization details, the specific functions are expressed as:

wherein < > represents the inner product of the vector;

(3) reconstructing a partial total loss function l _rec : the l is _rec The specific function is expressed as:

wherein ,w₁ 、w ₂ 、、w ₃ 、w ₄ The weighting coefficients representing the different parts;

the loss function design of the sampling part and the reconstruction part is integrated, and the total loss function l of the reconstruction network can be obtained _final The method comprises the following steps:

l _final ＝l _rec +βl _sample 。

where β represents a weighting coefficient.

Compared with the prior art, the invention provides a depth complement method based on sparse representation, which has the following beneficial effects:

(1) The invention can be improved to a greater extent on the aspect of sampling strategies, and can effectively utilize the sampling points of good edge parts under the condition that the edge depth information of an object cannot be effectively captured by a traditional random sampling mode.

(2) The method provided by the invention can reconstruct a denser effect, thereby opening a new idea for reducing the equipment cost in the actual application scene.

Drawings

FIG. 1 is an overall block diagram of a depth completion method based on sparse representation;

fig. 2 is an effect diagram of an example of embodiment 1 of the present invention, and each column sequentially includes, from left to right, input RGB, randomly sampled sampling points, an uncertainty map, adaptively sampled sampling points, a reconstructed dense depth map, and an original real depth map.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

The invention uses an indoor data set NYUDEPthV2 and an outdoor data set KITTI as experimental data sets, wherein the NYUDEPthV2 data set is a video sequence of an indoor scene shot by an RGBD camera of Microsoft Kinect, and comprises about 5 ten thousand indoor RGB-D image pairs collected under 464 different indoor scenes. The present invention uses official data set partitioning, with 249 scenes for training and the remaining 215 for testing. The invention first downsamples the RGB-D image pair from the original 640 x 480 resolution to 320 x 240. Since the boundaries of the original depth map contain no measurements, the present invention only evaluates the 304 x 228 center cropped region. In the KITTI dataset, the training set contains 85898 RGB-D image pairs, the validation set contains 1000 RGB-D image pairs, and the other 1000 frames are used for the test set. The dataset provides an RGB image and an aligned sparse depth map obtained by projecting 3D LiDAR points onto corresponding image frames, where the color image and depth image are at the same resolution of 352 x 1216. The original 64-line laser radar scanning depth map has about 5% of effective pixels, and the ground real semi-dense depth map has about 15% of effective pixels. Since there are invalid pixels at the upper boundary of the depth map, the crop is 256 x 1216 in size for the training and testing stage. Specific examples are as follows.

Example 1:

referring to fig. 1-2, the invention provides a depth complement method based on sparse representation, which specifically comprises the following steps:

the sampling network adopts a U-net type structural design and consists of an encoder and a decoder;

the encoder comprises four residual modules, wherein each residual module consists of a feature extraction downsampling module and a feature holding module, and the feature extraction downsampling module is used for extracting high-latitude features of the image and completing downsampling of the image; the feature holding module is used for ensuring that the feature map is further deepened under the condition that the resolution is not lost; the 1*1 convolution and 3*3 convolution are mainly used, and the corresponding batch normalization and RELU activation functions;

the decoder consists of four receptive field dominant up-sampling modules;

jump connection is arranged between the encoder and the decoder to further promote feature fusion, and finally an uncertainty graph is output;

the uncertainty graph represents a sampling logic relationship, and has high uncertainty, namely represents low sampling probability; based on this, S2 specifically includes the following:

the sampling process of the depth map D is defined as:

S＝D·M

S2.3, the sampling probability of the pixel of the element in the uncertainty diagram is denoted as p' _i,j ，p’ _i,j ∈[0,1]The method comprises the steps of carrying out a first treatment on the surface of the Will ber _i,j And p' _i,j Comparing to satisfy r _i,j ≤p’ _i,j The corresponding binary mask position for the condition is set to 1, otherwise to 0, expressed as:

s2.4, based on the contents of S2.1-S2.3, defining the sampling process of the uncertainty map P as follows:

S’＝P·M’

wherein S' represents a sparse depth map after the uncertainty map P is sampled; dot product representing pixel level;

s3, designing a reconstructed neural network structure, constructing an image reconstruction network, inputting the RGB image and the sampled sparse depth map obtained in the S2 into the reconstruction network together for training, and recovering a dense depth map; the multi-scale convolution module is mainly formed by fusing convolution kernels with the sizes of 1 and 3, and comprises addition of element levels and cascading in the channel direction, so that features are extracted more effectively. The coding part of the parallel double-flow coding and decoding network is mainly a pre-trained MobileNet V3-large, and has the main advantages of light weight, capability of guaranteeing the effectiveness of extracted features, and the network of the decoding part is also an up-sampling module guided by a receptive field. A channel attention mechanism is employed between the encoder and decoder to facilitate further fusion of features. Finally, calculating the respective characteristic weights of the double streams through a sigmoid activation function, and finally obtaining the reconstructed depth map.

Connecting the sampled sparse depth map with the RGB image in the channel dimension, and inputting the sparse depth map and the RGB image into a reconstruction network into an image with 4 channels;

the reconstruction network consists of a multi-scale convolution module, a parallel double-flow encoding-decoding structure and a self-adaptive fusion mechanism, and dense depth map reconstruction is realized by the reconstruction network through cross-channel information interaction;

(1) l2Loss of Loss function L _prob ：l _prob The specific function used to supervise the generated uncertainty graph is expressed as:

wherein ,representing a Sobel gradient operation symbol;

(2) regular loss function l _reg ：l _reg To constrain the training process, the specific functions are expressed as:

(3) sampling part total loss function _sample ：l _sample The specific function is expressed as:

l _sample ＝l _prob +αl _reg

wherein α represents a weighting coefficient;

(1) reconstructed depth map D ^* Gradient loss term L based on L1 loss between the depth map D and the true depth map D _grad ：l _grad To reduce the error in computing depth gradients, the specific function is expressed as:

(2) surface normal loss function l _norm ：l _norm To further deepen the localization details, the specific functions are expressed as:

wherein < > represents the inner product of the vector;

(3) reconstructing a partial total loss function l _rec ：l _rec The specific function is expressed as:

l _final ＝l _rec +βl _sample 。

where β represents a weighting coefficient.

In the invention, an Adam optimizer is used, and the parameter is set as beta ₁ ＝0.9,β ₂ =0.999 and set the weight decay to 10 ^-5 . The initial learning rate was set to 0.001. The invention uses a deep learning framework Pytorch training model to train 20 cycles on the whole training set in total, and the learning rate is adjusted to 80% before training every 2 cycles. In addition, to prevent overfitting while improving overall model performance, two data enhancement strategies are used herein to increase the diversity of training data, including:

random horizontal flip: both the color image and the depth image are flipped with a 50% probability level;

random channel switching: the RGB three channels of the color image are randomly swapped with 50% probability.

Example 2:

based on example 1 but with the difference that:

the invention selects 10 advanced comparison methods for training on the KITTI data set, comprising the following steps: ACMNet, sparse-to-Dense, CSPN, deepLIDAR, NConv-CNN, MSG-CHN, guideNet, uncertainty, PENet and AdaptiveLIDAR. For both 256 samples and 512 samples, see tables 1 and 2 for specific results.

Table 1 comparison of 256 sample point results

Table 2 comparison of 512 sample point results

Method	RMSE	MAE	iRMSE	iMAE	REL	δ _1.25
							ACMNet	3417.34	1413.75	15.54	7.91	0.086	90.4
Sparse-to-Dense	2151.12	659.15	4.41	2.29	0.033	98.2
							CSPN	1828.93	506.39	4.04	2.16	0.023	99.0
DeepLIDAR	1735.29	543.62	3.96	1.97	0.025	98.9
							NConv-CNN	1973.10	508.64	4.27	2.14	0.025	98.7
MSG-CHN	1862.38	591.95	4.13	2.14	0.029	98.6
							GuideNet	1787.59	554.37	3.98	2.06	0.028	99.0
Uncertainty	1771.60	568.18	4.08	2.16	0.026	98.8
							PENet	1842.54	597.16	4.31	2.29	0.030	98.5
AdaptiveLIDAR	1789.41	590.62	3.92	1.89	0.027	98.7
							Ours	1346.79	446.57	4.70	2.15	0.025	99.2

As shown in tables 1 and 2, the tables show the values at RMSE, MAE, iRMSE, REL and delta _1.25 Quantitative comparison results on indexes, wherein RMSE, MAE, iRMSE, REL is smaller and better, delta _1.25 The larger the better; as can be seen from the table, the method of the present invention is important for RMSE, MAE, delta _1.25 The index can reach the best effect.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The depth completion method based on sparse representation is characterized by comprising the following steps of:

2. The sparse representation-based depth completion method of claim 1, wherein the sampling network in S1 is designed by a U-net type structure, and is composed of an encoder and a decoder;

the decoder consists of four receptive field dominant up-sampling modules;

3. A depth completion method based on sparse representation according to claim 1, wherein said uncertainty map represents a sampling logic relationship with high uncertainty, i.e. low probability of representing sampling; based on this, the S2 specifically includes the following:

the sampling process of the depth map D is defined as:

S＝D·M

S’＝P·M’

4. The sparse representation-based depth completion method of claim 1, wherein the reconstruction network in S3 consists of a multi-scale convolution module, a parallel double-stream encoding-decoding structure and an adaptive fusion mechanism, and the reconstruction network realizes dense depth map reconstruction through cross-channel information interaction;

wherein ,representing a Sobel gradient operation symbol;

l _sample ＝l _prob +αl _reg

wherein α represents a weighting coefficient;

wherein < > represents the inner product of the vector;

l _rec ＝w1[l ₁ (D _low ^* ，D)+l _grad (D _low ^* ，D)+l _norm (D _low ^* ，D)]+w ₂ l ₁ (D _up ^* ，D)+l ₁ (D ^* ，D)+w ₃ l _grad (D ^* ，D)+w ₄ (D ^* ，D)

l _finai ＝l _rec +βl _sample 。

where β represents a weighting coefficient.