CN112085702A

CN112085702A - Monocular depth estimation method based on sparse depth of key region

Info

Publication number: CN112085702A
Application number: CN202010777954.9A
Authority: CN
Inventors: 颜成钢; 张杰华; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2020-12-15

Abstract

The invention provides a monocular depth estimation method based on sparse depth of a key region. Inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, then performing up-sampling to obtain a predicted depth map with the same size as the input image, calculating a loss function of the network, performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters. And training for multiple rounds to obtain a final network model. And finally, testing through the test set. The method has the advantages that the sampling points are more reasonable and targeted, the key points for depth estimation of the neural network are selected, the quantitative effect of the depth estimation is improved, the depth is more accurate compared with the depth predicted by the conventional method, the error is smaller, and the generated depth map effect is clearer.

Description

Monocular depth estimation method based on sparse depth of key region

Technical Field

The invention relates to the field of computer vision, in particular to a monocular depth estimation method based on a sparse depth map.

Background

Depth estimation has been widely applied in engineering practice, such as automatic driving, augmented reality, etc., and is a very important research direction in the field of computer vision and a very popular research subject in recent years. The methods for depth estimation are various, and besides monocular camera ranging, there are also several common methods, such as ranging using laser radar, with an accuracy that is not comparable to other methods; ranging by a structured light sensor such as a Kinect; in addition, methods such as binocular camera ranging are available. The laser radar ranging has extremely high accuracy, but is often forbidden due to the high selling price of thousands of dollars, and the laser sensor is easily influenced by environmental factors such as haze and the like; the Kinect and other structured light sensors are short in detection distance, relatively high in energy consumption and sensitive to illumination intensity; binocular camera ranging requires fine manual calibration.

Based on the various defects of the above methods and the rapid development of deep learning in recent years, the use of a monocular camera in combination with a method based on deep learning for depth estimation has attracted extensive attention of researchers, and has gained rapid development, and many monocular depth estimation methods based on deep learning have been proposed in the academic community, and can be roughly classified into three types, namely a supervised method, a weak supervised method and an unsupervised method, from the aspect of supervision information. The input data mainly used at present is still RGB images, and although some progress has been made in this method in recent years, the depth estimation using RGB images is inherently a pathological problem, so the overall accuracy and reliability are still unsatisfactory.

Therefore, a sparse depth is considered to be adopted as a supplement of the RGB image, a reference point is provided for the neural network to predict the depth, and various methods such as SLAM or cheap laser radar are available for acquiring the sparse depth. Thereby greatly improving the accuracy of monocular depth estimation.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: due to the high cost of the laser radar and the low accuracy and reliability of the monocular depth estimation based on the RGB images, the existing monocular depth estimation method based on the depth learning cannot provide an accurate depth map.

In view of the above situation, the present invention provides a monocular depth estimation method based on a sparse depth map. The invention discloses a method for generating a dense depth map by extracting sparse depth of a key area as monocular depth estimation auxiliary input and extracting features by using a deep neural network. The method mainly comprises the following design points: 1) segmenting and data enhancing a data set; 2) designing a network with a self-encoder structure, and performing feature extraction and up-sampling on input data; 3) extracting key areas in the RGB image to obtain sparse depth of the areas; 4) and (5) testing the trained model on a test set.

A monocular depth estimation method based on sparse depth of a key region comprises the following steps:

step 1, preprocessing a data set, cutting, rotating and changing brightness of a training set, and cutting a testing set.

Step 2, designing a network model structure;

the network model is divided into two parts of an encoder and an up-sampling network. The encoder part of the network adopts Resnet-50, removes the last average pooling layer and the full-link layer, and replaces the last average pooling layer and the full-link layer with the convolution layer and the normalization layer with the core size of 1 x 1; the upper sampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed;

and 3, training the network, inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, and then performing up-sampling to obtain a predicted depth map pred with the same size as the input image.

And 4, calculating a loss function of the network, performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters. And training for multiple rounds to obtain a final network model.

And 5, testing the network model. And inputting the data image of the test set and the corresponding sparse depth into the trained model to obtain a predicted depth map pred, and calculating each evaluation index.

The above steps are specifically described below:

the step 1 is as follows:

the preprocessing of the training set comprises: zooming the training set data, then randomly turning the training set data horizontally, then randomly rotating the training set data, then performing center cutting on the training set data, performing color data enhancement on the training set data, and finally regularizing the training set data;

the preprocessing of the test set includes: and zooming the test set data, then performing center cropping, and finally regularizing the test data.

The network structure of step 2 is specifically as follows:

the encoder portion of the network uses Resnet-50 to remove the last average pooling layer and full connectivity layer and replace it with convolutional and normalization layers with a core size of 1 x 1. The upsampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed. The UpProj module comprises an upper branch and a lower branch, firstly, input data are subjected to up-sampling through an unprol layer, then the data pass through a convolution layer and a standardization layer, are activated by a relu function, then pass through the upper branch and the lower branch respectively and then are added, and finally, the data are activated by the relu function. The structure of the lower branch is a convolution layer and a normalization layer with convolution kernel size of 5 x 5 in sequence, and the convolution layer and the normalization layer with kernel size of 3 x 3 pass through after being activated by the relu function. The upper branching structure is a convolutional layer and a normalization layer with a core size of 5 x 5.

The sparse depth in step 3 is obtained specifically as follows:

and performing Gaussian filtering on the RGB image in the input training set, wherein the Gaussian filtering adopts a convolution layer with a convolution kernel size of 3 x 3. And extracting the image edge by using a canny operator to obtain a mask. And generating a random number matrix s _ mask which has the same size as the mask and the value of 0 to 1 through numpy, and setting a threshold prob to enable the value of the random number matrix s _ mask which is smaller than the threshold prob to be 0 and the rest to be 1. And carrying out bitwise AND operation on the random number matrix s _ mask and the extracted mask to obtain a final sparse depth mask _ depth _ mask. And generating a full 0 matrix sparse _ depth with the same size as the depth map through numpy, and taking the depth value of the position with the sparse depth mask sparse _ depth _ mask value of 1 in the corresponding depth map in the data set as the value of the corresponding position in the full 0 matrix sparse _ depth to obtain the final sparse depth of the key area.

Further, the parameter of the Canny operator is set as: the ratio of the low threshold to the high threshold is 1: 3, the nucleus size is 3 x 3.

The loss function L and related parameters of step 4 are as follows:

the loss function L contains three terms:

L＝l_depth+l_grad+l_ssim

wherein l_depthIs the depth error, which is used to make the prediction and training set group Truth closer, and is defined as follows:

n is the number of pixels of the image input into the neural network,

one pixel in depth, y, predicted for neural networks_pIs a pixel in the training set group Truth.

l_gradIs a gradient error for making the edges of the generated depth map sharper and clearer, defined as follows:

wherein the content of the first and second substances,

is the differential in the x-direction,

is the differential in the y-direction.

The structure similarity error is adopted, so that the generated depth map has a better display effect:

wherein, mu represents the mean value,

represents the variance of x and represents the variance of x,

represents the variance, σ, of y_xyRepresents the covariance of x, y, c₁、c₂Is two constants.

The evaluation index in step 5 includes:

root Mean Square Error (RMSE):

average phasePair error (REL):

threshold accuracy (_i):

Card (x) is the number of elements in a set x, and is used in the present invention to calculate the number of pixels.

The invention has the following beneficial effects:

according to the invention, the sparse depth of the edge of the object is extracted as auxiliary input, so that the network can obtain the depth information of the key position, and the prediction can be made better. Meanwhile, the quality of the prediction image is further improved by adding gradient loss and structural similarity loss in the loss function, and the generated image is sharper and has clearer edges.

1. The sampling points are more reasonable and targeted, and the key points for deep estimation of the neural network are selected.

2. The depth estimation method has the advantages that the quantitative effect of the depth estimation is improved, the depth is more accurate compared with the depth predicted by the conventional method, and the error is smaller.

3. The generated depth map has clearer effect.

Drawings

FIG. 1 is an algorithm flow diagram;

FIG. 2 is a flow chart of edge sparse depth extraction;

fig. 3 is a network configuration diagram.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

As shown in fig. 1, the present invention comprises the steps of:

step 1, preprocessing a data set, taking NYUv2 data set as an example, the size of an original image (the original image includes an RGB image and a depth map) is 480 × 640, and performing the following processing:

step 1.1, narrow side is zoomed to 240, long side is zoomed to 320 correspondingly, resource consumed by follow-up operation is reduced

Step 1.2, random horizontal overturning and random rotation are carried out, the random rotation angle is +/-5 degrees, and the diversity of data is enhanced

Step 1.3, center crop is performed, cropping the image to 228 x 304, and converting to tenor data type

And 1.4. Performing principal component analysis to reduce dimension of data set with characteristic value of

[0.2175，0.0188，0.0045]

The feature vector is

And step 1.5, carrying out color dithering, wherein the variation range comprises the variation of contrast, brightness and saturation, and the variation range is +/-0.4.

Step 1.6, data normalization is carried out, and the normalized mean value is

[0.485 0.456 0.406]

Standard deviation of

[0.229 0.224 0.225]

The training set is subjected to the above 6 items of image data processing, and the test set is only subjected to step 1.1, step 1.3 and step 1.6.

Step 2. as shown in fig. 3, the network backbone includes an encoder and a decoder for performing upsampling. The encoder is Resnet-50, and the Resnet-50 comprises 6 parts: 7 by 7 convolutional layers; the four convolution modules, Block1 through Block4, have 9, 12, 18, 9 convolutional layers respectively, the last average pooling layer and the fully connected layer are removed and replaced with convolutional layers and normalization layers with core size 1 x 1. After input is input into the encoder, the output obtained from the first 5 parts is 2048 × 8 × 10 eigenvectors, and the output is convolved with 1 × 1 of the 6 th part to obtain 1024 × 8 × 10 eigenvectors.

And then entering an upsampling part, wherein the upsampling part also comprises 6 parts, the first 4 parts are repeated upsampling layers, namely UpProj modules, an UpProj structure is adopted, and then a convolution layer with a convolution kernel size of 3 x 3 and a bilinear interpolation layer are adopted. The UpProj module comprises an upper branch and a lower branch, firstly, input data are subjected to up-sampling through an unprol layer, then pass through a convolution layer and a standard layer, are activated by a relu function, and then pass through the upper branch and the lower branch respectively and then are added for activation. The structure of the lower branch is a convolution layer and a normalization layer with convolution kernel size of 5 x 5 in sequence, and the convolution layer and the normalization layer with kernel size of 3 x 3 pass through after being activated by the relu function. The upper branch structure is a convolution layer and a normalization layer with convolution kernel size of 5 x 5.

As shown in fig. 2, the method for extracting edge sparse depth includes:

taking the NYUv2 dataset as an example, the size of the picture processed in step 1 is 228 x 304, and the rgb picture image in the dataset is gaussian filtered with a kernel size of 3 x 3. And extracting the image edge by using a canny operator to obtain a mask. And generating a random number matrix s _ mask with the size of 228 × 304 and the value of 0 to 1, and setting a threshold value prob to enable the value of s _ mask smaller than prob to be 0 and the rest to be 1.

The meaning of the formula is to set the threshold as the ratio of the number of pixels to be sampled to the non-zero element in the edge detection result, where num _ samples is 200 in this embodiment. And carrying out bitwise AND operation on the s _ mask and the extracted mask to obtain a final sparse depth mask _ depth _ mask. And generating a full 0 matrix sparse _ depth with the size of 228 × 304, and taking the depth value of the position with the sparse _ depth _ mask value of 1 in a corresponding depth map in the data set as the value of the corresponding position in the sparse _ depth to finally obtain the sparse depth sparse _ depth of the key area. The image and spark _ depth are fused into a 4-channel matrix to obtain the final network input, and the single size is 4 × 228 × 304.

The parameters of the Canny operator are set as follows: the ratio of the low threshold to the high threshold is 1: 3, the nucleus size is 3 x 3.

Inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, wherein the number of channels is changed to be one half of the original number and the length and the width are changed to be two times of the original number every time a feature vector output by the encoder passes through an upsampling layer, after the feature vector passes through four upsampling layers, the feature vector is changed from 1024 × 8 × 10 to 64 × 128 160, after 3 × 3 convolution layers and a normalization layer, the feature vector is 1 × 228 × 304, and the depth map has the same size as the depth map processed in the step 1.

And 4, constructing a loss function L, calculating the error of each forward propagation, and updating the weight of the neural network through a back propagation algorithm. Taking an NYUv2 data set as an example, 50688 training samples are included in the data set, the batch size is set to 32, each iteration is performed for 1584 times, 20 rounds of training are performed in total, the optimizer searches the weight which enables the loss function to be minimum in each iteration, the final weight is obtained after the training is finished, and the model which shows the best performance can be stored by comparing the performance of the models obtained after each round of training is finished.

The specific loss function is:

L＝l_depth+l_grad+l_ssim

wherein l_depthIs the depth error, used to make the prediction closer to the Ground Truth, defined as follows:

n is the number of pixels of the image inputted to the neural network, and after the processing of step 1, in this embodiment n 69312,

one pixel in depth, y, predicted for neural networks_pIs one pixel in the group Truth in the data set.

l_gradIs gradient error for making the edge of the generated depth map clearer and sharperThe definition is as follows:

is the differential in the x-direction,

is the differential in the y-direction.

wherein μ represents the mean value of the mean value,

represents the variance of x and represents the variance of x,

represents the variance, σ, of y_xyRepresents the covariance of x, y, c₁、c₂Equal to 1, 9 respectively.

The optimizer was a random gradient descent (SGD), the learning rate was 0.01, and 5 rounds of training were performed down to 10% of the original, the momentum was 0.9, and the weight decay was 0.0004.

And 5, loading the model stored in the step 4, and testing through the test set. And obtaining a predicted depth map by the trained model of each sample on the test set, comparing the predicted depth map with the group Truth in the test set, and calculating each evaluation index.

The training and testing environment in this embodiment is:

the system comprises the following steps: ubuntu 16.04

Cpu：Intel(R)Xeon(R)Silver 4114CPU@2.20GHz*4

Memory: 128GB

GPU：RTX2080Ti*4

The evaluation indexes in this example include:

root Mean Square Error (RMSE):

mean Relative Error (REL):

threshold accuracy (_i):

Card (x) is the number of elements in a set x, here the number of pixels.

The test results are:

RMSE＝0.221，REL＝0.044，₁＝0.972。

Claims

1. a monocular depth estimation method based on sparse depth of a key region is characterized by comprising the following steps:

step 1, preprocessing a data set, cutting, rotating and changing brightness of a training set, and cutting a test set;

step 2, designing a network model structure;

the network model is divided into two parts of an encoder and an up-sampling network; the encoder part of the network adopts Resnet-50, removes the last average pooling layer and the full-link layer, and replaces the last average pooling layer and the full-link layer with the convolution layer and the normalization layer with the core size of 1 x 1; the upper sampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed;

step 3, training a network, inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, and then performing up-sampling to obtain a prediction depth map pred with the same size as the input image;

step 4, calculating a loss function of the network, performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model;

step 5, testing the network model; and inputting the data image of the test set and the corresponding sparse depth into the trained model to obtain a predicted depth map pred, and calculating each evaluation index.

2. The method for monocular depth estimation based on sparse depth of key regions according to claim 1, wherein step 1 specifically comprises:

3. The method for monocular depth estimation based on sparse depth of key regions as claimed in claim 2, wherein the network structure of step 2 is specifically as follows:

the encoder part of the network adopts Resnet-50, removes the last average pooling layer and the full-link layer, and replaces the last average pooling layer and the full-link layer with the convolution layer and the normalization layer with the core size of 1 x 1; the upper sampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed; the UpProj module comprises an upper branch and a lower branch, firstly, input data are subjected to up-sampling through an unprol layer, then pass through a convolution layer and a standardization layer, are activated by a relu function, then pass through the upper branch and the lower branch respectively and then are added, and finally, the input data are activated by the relu function; the structure of the lower branch is a convolution layer and a standard layer with convolution kernel size of 5 x 5 in sequence, and the convolution layer and the standard layer with kernel size of 3 x 3 pass through after being activated by a relu function; the upper branching structure is a convolutional layer and a normalization layer with a core size of 5 x 5.

4. The method for monocular depth estimation based on sparse depth of a key region according to claim 1, wherein the sparse depth of step 3 is obtained specifically as follows:

performing Gaussian filtering on the input RGB image in the training set, wherein the Gaussian filtering adopts a convolution layer with a convolution kernel size of 3 x 3; extracting the edge of the image by using a canny operator to obtain a mask; generating a random number matrix s _ mask which has the same size as the mask and a value between 0 and 1 through numpy, and setting a threshold prob to enable the value which is smaller than the threshold prob in the random number matrix s _ mask to be 0 and the rest to be 1; carrying out bitwise AND operation on the random number matrix s _ mask and the extracted mask to obtain a final sparse depth mask _ depth _ mask; and generating a full 0 matrix sparse _ depth with the same size as the depth map through numpy, and taking the depth value of the position with the sparse depth mask sparse _ depth _ mask value of 1 in the corresponding depth map in the data set as the value of the corresponding position in the full 0 matrix sparse _ depth to obtain the final sparse depth of the key area.

5. The method of claim 3, wherein the loss function L and related parameters in step 4 are as follows:

the loss function L contains three terms:

L＝l_depth+l_grad+l_ssim

n is the number of pixels of the image input into the neural network,

one pixel in depth, y, predicted for neural networks_pA pixel in a training set group Truth is obtained;

wherein the content of the first and second substances,

is the differential in the x-direction,

is the differential in the y direction;

wherein, mu represents the mean value,

represents xThe variance of (a) is determined,

6. The method according to claim 5, wherein the evaluation index in step 5 comprises:

root mean square error

Average relative error

Threshold accuracy (_i)：

7. The method according to claim 4, wherein the parameters of the Canny operator are set as follows: the ratio of the low threshold to the high threshold is 1: 3, the nucleus size is 3 x 3.