CN109685842B

CN109685842B - Sparse depth densification method based on multi-scale network

Info

Publication number: CN109685842B
Application number: CN201811531022.5A
Authority: CN
Inventors: 刘光辉; 朱志鹏; 孙铁成; 李茹; 徐增荣
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2023-03-21
Anticipated expiration: 2038-12-14
Also published as: CN109685842A

Abstract

The invention discloses a sparse depth densification method based on a multi-scale network. Belongs to the technical field of depth estimation of computer vision. The invention uses the multi-scale convolution neural network to effectively fuse the RGB image data and the sparse point cloud data, and finally obtains the dense depth image. And mapping the sparse point cloud to a two-dimensional plane to generate a sparse depth map, aligning the sparse depth map with the RGB image, connecting the sparse depth map and the RGB image together to generate an RGBD image, inputting the RGBD image to a multi-scale convolution neural network for training and testing, and finally estimating a dense depth map. The depth is estimated in a mode of combining the RGB image and the sparse point cloud, and the distance information contained in the point cloud can guide the RGB image to be converted into a depth map; the multi-scale network utilizes the information of different resolutions of the original data, on one hand, the visual field is expanded, on the other hand, the input depth map on the small resolution is denser, and higher accuracy can be obtained.

Description

Sparse depth densification method based on multi-scale network

Technical Field

The invention belongs to the field of depth estimation of computer vision, and particularly relates to a sparse depth densification method based on a multi-scale convolutional neural network.

Background

In unmanned driving, a perception system based on computer vision technology is the most basic part. At present, a camera based on visible light is most commonly used in the unmanned sensing system, and the camera has the advantages of low cost, mature related technology and the like. However, visible light-based cameras also have significant disadvantages: first, the RGB image captured by the camera only has color information, and if the texture of the target is complex, the sensing system is prone to misjudgment. Second, in some environments, visible light based cameras can fail. For example, at night when the light is insufficient, the camera is difficult to work normally. Lidar is also a sensor often used by unmanned sensing systems. The laser radar is not easily affected by illumination conditions, the collected point cloud data has three-dimensional characteristics, a depth image can be directly obtained from the point cloud data, the depth image is formed by mapping the point cloud to a two-dimensional plane, and the value of each pixel point represents the distance from the point to the sensor. Compared with an RGB image, the distance information contained in the depth image is more helpful to tasks such as object recognition and segmentation. However, the laser radar is expensive, the collected point cloud is too sparse, the generated depth map is also too sparse, and the using effect of the laser radar is influenced to a certain degree.

Disclosure of Invention

The invention aims to: in view of the above problems, a method for densifying a sparse depth using a multi-scale network is provided.

The sparse depth densification method based on the multi-scale network comprises the following steps:

constructing a multi-scale network model:

the multi-scale network model comprises L (L is more than or equal to 2) input branch branches, output corresponding points of the L branch branches are added and then input into the information fusion layer, and an upper sampling processing layer is connected behind the information fusion layer and serves as an output layer of the multi-scale network model;

wherein, L routes input the branch road, wherein a route of branch road is regarded as the input of the original image; the residual L-1 path is used as an original image to carry out different down-sampling to obtain the input of a down-sampling image; the size of an output image of an output layer of the multi-scale network model is the same as that of an original image;

and the input data of the L input branch comprises: an RGB image and a sparse depth map; the downsampling mode of the sparse depth map of the original image is as follows: for the sparse depth map, dividing the sparse depth map into grids according to pixels based on a preset downsampling multiple K, wherein each grid comprises KxK original input pixels; and setting a mark value s of each original input pixel based on the depth value of the original input pixel _i If the depth value of the current original input pixel is 0, s _i =0; otherwise s _i =1; where i is the specifier of K × K original input pixels included in each grid; and according to the formula

Obtaining the depth value p of each mesh _new Wherein p is _i A depth value representing the original input pixel i;

inputting a network structure of a branch which is an original image into a first network structure;

the network structure of the branch of the down-sampling image input as the original image is as follows: adding K/2 16-channel upsampling volume blocks D behind the first network structure, wherein K represents the downsampling multiple of the original image;

the first network structure includes fourteen layers, which are respectively:

the first layer is an input layer and a pooling layer, the convolution kernel size of the input layer is 7*7, the number of channels is 64, and the convolution step length is 2; the pooling layer adopts maximum pooling, the size of a convolution kernel is 3*3, and a pooling constant is 2;

the second layer and the third layer have the same structure and are both a 64-channel R ¹ A residual convolution block;

the fourth layer is a 128-channel R ² A residual convolution block;

the fifth layer is a 128-channel R ¹ A residual convolution block;

the sixth layer is a 256-channel R ² A residual convolution block;

the seventh layer is R with 256 channels ¹ A residual convolution block;

the eighth layer is a 512-channel R ² A residual convolution block;

the ninth layer is a 512-channel R ¹ A residual convolution block;

the tenth layer is a convolution layer, the convolution kernel size is 3*3, the channel number is 256, and the convolution step size is 1;

the eleventh layer is an up-sampling rolling block D with 128 channels, and the output of the eleventh layer and the output of the seventh layer are input into the twelfth layer after being overlapped according to the channels;

the twelfth layer is an up-sampling rolling block D with 64 channels, and the output of the twelfth layer and the output of the fifth layer are overlapped according to the channels and then input into the thirteenth layer;

the thirteenth layer is an up-sampling volume block D with 32 channels, and the output of the twelfth layer and the output of the eleventh layer are input into the fourteenth layer after being overlapped according to the channels;

the fourteenth layer is an upsampling volume block D with 16 channels;

the R is ¹ The residual convolution block comprises two layers of convolution layers with the same structure, the convolution kernel size is 3*3, the convolution step length is 1, and the number of channels is adjustable; and will input R ¹ Adding the input data of the residual volume block and the output corresponding point of the second layer to access a ReLU activation function as R ¹ An output layer of the residual convolution block;

the R is ² The residual convolution block includes first, second and third convolution layers, and an input R ² The input data of the residual convolution block respectively enters two branches, and then the output corresponding points of the two branches are added to be connected with a ReLU activation function as R ² An output layer of the residual convolution block; one branch is a first convolution layer and a second convolution layer which are connected in sequence, and the other branch is a third convolution layer;

the first convolution layer and the second convolution layer are identical in structure, the convolution kernel size is 3*3, the convolution step length is 2, and the number of channels can be adjusted; the third convolution layer has the convolution kernel size of 3*3, the convolution step length is 1, and the number of channels is adjustable;

the up-sampling convolution block D comprises two amplification modules and a convolution layer, wherein input data input into the up-sampling convolution block D respectively enter two branches, and output corresponding points of the two branches are added to be connected with a ReLU activation function to serve as an output layer of the up-sampling convolution block D; one branch is a first amplification module and a convolution layer which are connected in sequence, and the other branch is a second amplification module;

wherein, the convolution layer of the up-sampling convolution block D is: the convolution kernel size is 3*3, the convolution step is 1, and the number of channels is adjustable;

the amplification module of the up-sampling convolution block D comprises four parallel convolution layers, the number of channels of the four convolution layers is set to be the same, and the sizes of convolution kernels are respectively as follows: 3, 2, 3 and 2*2, the convolution step length is 1, and the input data of the input amplification module passes through the four convolution layers and then is spliced together to be used as the output of the amplification module;

the information fusion module is a convolution layer with the convolution kernel size of 3*3, the channel number of 1 and the convolution step length of 1;

and performing deep learning training on the constructed multi-scale network model, and obtaining a densification processing result of the image to be processed through the trained multi-scale network model.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: according to the depth estimation method, the depth is estimated by combining the sparse point cloud and the image, the RGB image is guided by the sparse depth, and is supplemented by the RGB image, the advantages of two data forms are combined, the depth estimation is performed by combining the multiple scale network models, and the accuracy of the depth estimation is improved.

Drawings

FIG. 1 is a schematic down-sampling illustration of the present invention in an embodiment;

fig. 2 is a schematic diagram of a residual convolution block in an embodiment. Wherein FIG. 2-a is a type one residual volume block and FIG. 2-b is a type two residual volume block;

fig. 3 is a schematic diagram of an upsampling convolution block in an embodiment. Wherein fig. 3-a is a schematic diagram of an amplification module, and fig. 3-b is a schematic diagram of an entire upsampled convolution block;

FIG. 4 is a diagram illustrating a multi-scale network architecture used in an exemplary embodiment;

FIG. 5 is a graph of results and comparative results of the present invention and prior art treatment methods in an embodiment. Wherein FIG. 5-a is an input RGB image and FIG. 5-b is a sparse depth map; FIG. 5-c is a depth estimation of FIG. 5-b using a prior art method; FIG. 5-d is a depth estimation result of FIG. 5-b according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

In order to meet the requirement of a specific scene (such as unmanned driving) on high depth image quality, the invention provides a method for densifying sparse depth by using a multi-scale network. The depth estimation method mainly utilizes the RGB image to directly obtain the dense depth, but because the depth image directly estimated by the two-dimensional image has inherent ambiguity, in order to solve the problem, the depth estimation method estimates the depth by combining the sparse point cloud and the image, the RGB image is guided by the sparse depth, the RGB image supplements the sparse depth, the advantages of two data forms are combined, the depth estimation is carried out under multiple scales, and the accuracy of the depth estimation is improved.

The invention uses the multi-scale convolution neural network to effectively fuse the RGB image data and the sparse point cloud data, and finally obtains the dense depth image. And mapping the sparse point cloud to a two-dimensional plane to generate a sparse Depth Map, aligning the sparse Depth Map with the RGB image, connecting the sparse Depth Map and the RGB image together to generate an RGBD (RGB + Depth Map) image, inputting the RGBD image to a multi-scale convolution neural network for training and testing, and finally estimating a dense Depth Map. The depth is estimated in a mode of combining the RGB image and the sparse point cloud, and distance information contained in the point cloud can guide the RGB image to be converted into a depth map; the multi-scale network utilizes the information of different resolutions of the original data, on one hand, the visual field is expanded, on the other hand, the input depth map on the small resolution is denser, and higher accuracy can be obtained.

The specific implementation process of the sparse depth densification method based on the multi-scale is as follows:

(1) Input data downsampling:

the feasible down-sampling multiples have a large relation to the size of the input data. For an input image of size M x N, a range of possible downsampling multiples is [2,min (M, N) × 2% ^-5 ]。

The sampling method is as follows: representing the selected down-sampling multiple by K, dividing the input sparse depth map into grids according to pixels, each grid containing K x K original input pixels, and dividing the input image into

A grid. Fig. 1 is a schematic diagram of a downsampling multiple of 2. Representing K x K pixels in the grid as a set of pixels P = { P = { P } ₁ ，p ₂ ，...，p _K*K }。

Since there are values with a depth of zero in the sparse depth map, these values are referred to as invalid values. Constructing a marking value s for marking an invalid value, and if the depth value of the pixel point is not equal to 0, determining that the pixel point is valid, and enabling s to be equal to 1; otherwise, the value is invalid, and s is equal to 0. So that a set of flag values corresponding to the set of pixels P can be obtained as S = { S = { S = } ₁ ，s ₂ ，...，s _K*K }。

The new depth value after the down sampling is as follows:

wherein p is _n Depth value, s, representing original pixel point _n Indicating the marking value of the original pixel.

And performing the operation on each divided grid to obtain a new depth map with smaller resolution and more density (short for a small-resolution depth map). Compared with the traditional downsampling method, the small-resolution depth map obtained by the method is denser, and the depth value is more accurate due to the fact that the influence of invalid values is eliminated. The RGB image down-sampling uses a conventional bilinear interpolation down-sampling method. And finally obtaining an image with small resolution and a sparse depth map.

(2) Constructing a residual volume block:

the residual volume block is an important component of the multi-scale network of the present invention, and is used to extract the features of the input data, and is divided into two types.

The type one is as follows: residual convolution block R ¹ The construction process is as follows: as shown in FIG. 2-a, the first layer of the residual convolution block is a convolution layer with a convolution kernel size of 3*3, a number of channels of n, and a convolution step (stride) of 1. The second layer has the same structure as the first layer. The input data is then added to the output corresponding points of the second layer. Finally, a ReLU activation function is accessed. The residual convolution block structure is fixed, but the number of channels of the convolution layer is variable, different residual convolution blocks can be obtained by adjusting the number of channels, and accordingly the type-one residual convolution block is named as n channels R ¹ 。R ¹ Is consistent with the input-output size of (a), wherein there is no down-sampling operation.

Type two: residual convolution block R ² The construction process is as follows: as shown in fig. 2-b, the first layer of the residual convolution block is a convolution layer with a convolution kernel size of 3*3, a number of channels of n, and a convolution step size of 2. The second layer is also a convolutional layer with a convolutional kernel size of 3*3, a number of channels of n, and a convolutional step size of 1. Then, the input data is passed through a convolution layer with a convolution kernel size of 1*1, the number of channels n, and a convolution step size of 2, and the output is added to the output corresponding point of the second layer. Finally, a ReLU activation function is accessed. And R ¹ Naming in a similar manner, type two residual volume block is named n-channel R ² 。R ² The input size of the operation is twice of the output size, the purpose of the operation is to enlarge the receptive field of the convolution kernel and better extract global features.

(3) Constructing an upsampling volume block:

the upsampling volume blocks are also an important part of the multi-scale network, and the role of the upsampling volume blocks is to amplify the input, and each upsampling volume block can amplify the input by one time. The construction process is as follows: the basic module of the upsampling convolutional block is an amplification module, as shown in fig. 3-a, the amplification module is composed of four convolutional layers in parallel, the number of channels of the four convolutional layers is n, the sizes of convolutional kernels are 3 × 3,3 × 2,2 × 3 and 2*2 respectively, the input is spliced together after passing through the four convolutional layers, and the output is doubled compared with the input. As shown in fig. 3-b, the upsampled convolution block is made up of two branches. The first layer of branch one is an amplifying module with the number of channels being n, and the amplifying module is followed by a ReLU activating function, the second layer is a convolution layer, the convolution kernel size is 3*3, and the number of channels is n. The second branch has only one layer, which is an amplifying module with the number of channels being n. And adding the output of the first branch and the corresponding point of the output of the second branch, and finally accessing a ReLU activation function. And R ¹ ，R ² The upsampled volume block is named n-channel D similarly.

(4) Constructing a multi-scale convolution network:

the multi-scale network can construct multiple scales, namely, multiple branches can be constructed, the constructed number of the branches is influenced by the size of the input image as the down-sampling multiple, and the size of the branches isFor M x N images, the upper limit of the number of branches is log ₂ (min(M，N)*2 ^-5 ) +1. The construction method takes two branches as an example, and two branches are required to be established, wherein the input of one branch is the original resolution, the input of the other branch is the 1/K original resolution, and K is the downsampling multiple of an input image. And finally, carrying out information fusion on the two branches.

The first branch, i.e. the branch whose input is the original resolution, is constructed as follows:

the first layer is an input layer and a pooling layer, the convolution kernel size of the input layer is 7*7, the number of channels is 64, and the convolution step size is 2. The pooling layer adopts maximum pooling, the convolution kernel size is 3*3, and the pooling constant is 2. The original input size is M x N4, and the size is changed after passing through the first layer

That is, the size is 1/4 of the original size, and the number of channels is 64.

The second layer is a 64-channel R ¹ Residual convolution block, denoted R ¹ ₁ 。

The third layer has the same structure as the second layer and is marked as R ¹ ₂ 。

The fourth layer is a 128-channel R ² Residual convolution block is denoted as R ² ₁ 。

The fifth layer is a 128-channel R ¹ Residual convolution block, denoted R ¹ ₃ 。

The sixth layer is a 256-channel R ² Residual volume block, denoted R ² ₂ 。

The seventh layer is a 256-channel R ¹ Residual volume block, denoted R ¹ ₄ 。

The eighth layer is a 512-channel R ² Residual volume block, denoted R ² ₃ 。

The ninth layer is a 512-channel R ¹ Residual volume block, denoted R ¹ ₅ 。

The tenth layer is a convolutional layer with a convolutional kernel size of 3*3, a number of channels of 256, and a convolutional step size of 1.

The eleventh layer is an upsampled 128-channel convolution block D, denoted as D ₁ 。

Then D is added ₁ And the output of (1) and the seventh layer R ¹ ₄ Are added together according to the channel, where R ¹ ₄ Has an output size of

D ₁ Has an output size of

The size after the superposition becomes

The significance of the superposition is that some original information lost in the convolution process can be obtained, so that the result is more accurate.

The twelfth layer is an upsampled volume block D of 64 channels, marked as D ₂ Then, D is ₂ Is output and R ¹ ₃ The outputs of (a) are superimposed according to the channel.

The thirteenth layer is an upsampled 32-channel convolution block D, denoted as D ₃ Then, D is ₃ Output and R ¹ ₂ The outputs of (a) are superimposed by channel.

The fourteenth layer is an upsampled 16-channel convolution block D, denoted as D ₄ 。

At this point, the network structure of the branch with the input of the original resolution is constructed.

The second branch, the branch with the input of 1/K original resolution, is constructed as follows:

the first fourteen layers have the same structure as the branch with the original resolution, and then the corresponding number of 16-channel upsampling volume blocks D are added according to the input size of the branch. For a tributary with an input of 1/K original resolution (downsampling multiple K), K/2 upsampled convolution blocks are added. Fig. 4 shows an example of a two-branch case where the second branch is input at 1/2 of the original resolution (the down-sampling multiple is 2), and the number of the up-sampling convolution blocks D to be added by the second branch is 1. The multi-resolution case is similar, if the input is 1/4 of the original resolution, two 16-channel upsampled volume blocks are added, and so on.

After the branch is constructed, the information of the two branches needs to be fused, and the structure of the information fusion is as follows: and adding the output of the first branch and the output corresponding point of the second branch to serve as the input of the information fusion module. The network structure of the information fusion module is a convolution layer, the convolution kernel size is 3*3, the number of channels is 1, and finally the layer output is subjected to linear up-sampling to obtain a final result with the size same as that of the original input.

The information fusion in the case of redundant multiple branches (more than two branches) is as follows:

(5) Setting of the loss function:

in the present embodiment, the loss function is a Smooth L1 loss function, that is, the loss function is a function of

Wherein d represents the depth value estimated by the convolutional neural network, d ^g Indicating standard depth values and N indicating the sum of the number of pixels in a depth map.

(6) Training and testing of the model:

in the present embodiment, the training data is derived from the public data set NYU-Depth-v2 dataset. The dataset contains RGB images and dense depth maps, with a size of 640 x 480. 48000 RGB images and corresponding dense depth maps are selected as training data in the training process; the test procedure selects 654 RGB images and their corresponding dense depths as test data. The input of the network is an RGB image and a sparse depth map, the sparse depth map does not exist in the data set, the sparse depth map can be obtained by randomly sampling 1000 points of the dense depth map, and the RGBD image is combined with the RGB image to serve as the input.

During training, the RGBD image is down-sampled to 320 × 240, then center cutting is performed to obtain an RGBD image (i.e., an original image input to the multi-scale network model) of 304 × 228, the image is used as an input of a first branch, and then the image is down-sampled twice according to the method described in step (1) to obtain an RGBD image of 152 × 114 as an input of a second branch. Training 8 images at a time requires 6000 times for training the entire data set, and 15 times for training the entire data set, for a total of 90000 times. The learning rate during training adopts the changed learning rate, the initial learning rate is set to be 0.01, the learning rate is reduced by 10 times after the data set is trained for 5 times, and the final learning rate is 0.0001. And after training, storing the parameters of the model.

And during testing, reading parameters of the model, inputting the processed data into the model in a data processing mode which is the same as that in the training process, and outputting a final result. As shown in fig. 5, there are some comparisons between the output results of the present invention and the existing deep learning method. The results of the invention are clearer overall, and the results of the invention are better detailed as can be seen from comparison of the results in the black box.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A sparse depth densification method based on a multi-scale network is characterized by comprising the following steps:

constructing a multi-scale network model:

the multi-scale network model comprises L input branch branches, the output corresponding points of the L branch branches are added and then input into the information fusion layer, and an up-sampling processing layer is connected behind the information fusion layer and serves as an output layer of the multi-scale network model;

and the input data of the L-path input branch circuit comprises: an RGB image and a sparse depth map; the downsampling mode of the sparse depth map of the original image is as follows: for the sparse depth map, dividing the sparse depth map into grids according to pixels based on a preset downsampling multiple K, wherein each grid comprises KxK original input pixels; and sets a tag value s for each original input pixel based on the depth value of the original input pixel _i If the depth value of the current original input pixel is 0, s _i =0; otherwise s _i =1; where i is the specifier of K × K original input pixels included in each grid; and according to the formula

the network structure of the branch of the down-sampling image which is input as the original image is as follows: adding K/2 upsampling volume blocks D with 16 channels behind the first network structure, wherein K represents the downsampling multiple of the original image;

the first network structure includes fourteen layers, which are respectively:

the fourth layer is a 128-channel R ² A residual convolution block;

the fifth layer is a 128-channel R ¹ A residual convolution block;

the sixth layer is a 256-channel R ² A residual convolution block;

the seventh layer is R with 256 channels ¹ A residual convolution block;

the eighth layer is a 512-channel R ² A residual convolution block;

the ninth layer is a 512-channel R ¹ A residual convolution block;

the fourteenth layer is an upsampling volume block D of 16 channels;

the R is ² The residual convolution block includes first, second and third convolution layers and an input R ² The input data of the residual convolution block respectively enters two branches, and then the output corresponding points of the two branches are added to be connected with a ReLU activation function as R ² An output layer of the residual convolution block; one branch is a first convolution layer and a second convolution layer which are connected in sequence, and the other branch is a third convolution layer;

the first convolution layer and the second convolution layer have the same structure, the convolution kernel size is 3*3, the convolution step length is 2, and the number of channels is adjustable; the third convolution layer has the convolution kernel size of 3*3, the convolution step length is 1, and the number of channels is adjustable;

the up-sampling convolution block D comprises two amplification modules and a convolution layer, wherein input data input into the up-sampling convolution block D respectively enter two branches, and output corresponding points of the two branches are added and connected into a ReLU activation function to serve as an output layer of the up-sampling convolution block D; one branch is a first amplification module and a convolution layer which are connected in sequence, and the other branch is a second amplification module;

and carrying out deep learning training on the constructed multi-scale network model, and obtaining a densification processing result of the image to be processed through the trained multi-scale network model.

2. The method of claim 1, wherein the RGB image of the original image is down-sampled in a manner of: a bilinear interpolation down-sampling method is used.

3. The method of claim 1, wherein the loss function used in deep learning training of the multi-scale network model is

Wherein d is _j The depth value of each pixel point output by the multi-scale network model, namely the depth value of an estimated value, j is a pixel point distinguisher,

and expressing the standard depth value of the pixel point, namely the corresponding label value of the training sample, wherein N expresses the sum of the pixel number of a sparse depth image.