CN112288788A

CN112288788A - Monocular image depth estimation method

Info

Publication number: CN112288788A
Application number: CN202011084248.2A
Authority: CN
Inventors: 霍智勇; 乔璐
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-29
Anticipated expiration: 2040-10-12
Also published as: CN112288788B

Abstract

A method of monocular image depth estimation, the method comprising: acquiring a training image; inputting the obtained training image into a depth prediction network which is constructed in advance for training to obtain a corresponding prediction depth map; and performing joint loss calculation on the obtained prediction depth map and the corresponding GT depth map by adopting a joint loss function of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss to obtain a corresponding monocular depth estimation map. By the aid of the scheme, accuracy of monocular image depth estimation can be improved.

Description

Monocular image depth estimation method

Technical Field

The invention relates to the technical field of image processing, in particular to a monocular image depth estimation method.

Background

The acquisition of three-dimensional depth information from two-dimensional images is an important problem in the field of computer vision and is also an important component for understanding the geometric relationship of scenes. The image depth information has important application in the fields of simultaneous localization and mapping (SLAM), navigation, target detection, semantic segmentation and the like.

Monocular image depth estimation, unlike conventional methods based on multi-view and binocular stereo matching, uses only images of a single view to perform depth estimation. Because most application scenes in real life only provide data of a single viewpoint, monocular image depth estimation is closer to the actual application requirement.

However, the conventional monocular image depth estimation method has a problem of low accuracy.

Disclosure of Invention

The invention aims to provide a monocular image depth estimation method to improve the accuracy of monocular image depth estimation.

In order to solve the above technical problem, an embodiment of the present invention provides a monocular image depth estimation method, where the method includes:

acquiring a training image;

inputting the obtained training image into a depth prediction network which is constructed in advance for training to obtain a corresponding prediction depth map;

and performing joint loss calculation on the obtained prediction depth map and the corresponding GT depth map by adopting a joint loss function of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss to obtain a corresponding monocular depth estimation map.

Optionally, before inputting the acquired training image into a pre-constructed depth prediction network for training, the method further includes:

expanding the training image to obtain a first training image;

adjusting the expanded training image to a resolution ratio to obtain a second training image;

and carrying out normalization processing on the second training image to obtain a preprocessed training image.

Optionally, the expanding the training image includes: and performing at least one of scaling, rotation and random horizontal flipping on the training image.

Optionally, the inputting the acquired training image into a pre-constructed depth prediction network for training includes:

inputting the preprocessed training images into a preset ResNet50 network to obtain a plurality of specific layer characteristic images with sequentially reduced resolution;

performing reverse-order traversal on the plurality of specific-layer feature images to obtain traversed current specific-layer feature images;

merging the current specific layer feature image and the corresponding feature fusion image to generate an image with the same resolution as that of the specific layer feature image of the previous bit sequence until the traversal of the specific layer feature images is completed; and the corresponding feature fusion image is obtained by fusing the feature image of the next specific layer with the bilinear up-sampling image of the feature image of the specific layer after the residual convolution is carried out on the feature image of the next specific layer.

Optionally, the joint loss function is:

L＝L_rank+αL_ms-ssim+βL_grad；

wherein L represents the joint loss function, L_rankRepresenting the ordering penalty based on random sampling, L_ms-ssRepresenting a multi-scale structure similarity loss function, L_gradThe method is characterized by representing the multi-scale invariant gradient matching loss, alpha represents the balance factor of the multi-scale structure similarity loss function, and beta represents the balance factor of the multi-scale invariant gradient matching loss function.

Optionally, the number of the specific layer feature images is 4.

Optionally, the GT depth map obtains the horizontal component of the optical flow of the binocular image using flonet 2.0.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

according to the scheme, a training image is obtained; inputting the obtained training image into a depth prediction network which is constructed in advance for training to obtain a corresponding prediction depth map; and performing joint loss calculation on the obtained prediction depth map and the corresponding GT depth map by adopting a joint loss function of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss to obtain a corresponding monocular depth estimation map. According to the scheme, during training, the predicted depth map and the GT depth map are subjected to combined loss function calculation of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss, the problems of geometric inconsistency and edge blurring of the predicted depth map caused by only adopting sequencing loss based on random sampling point pairs are solved, and therefore the accuracy of the predicted depth map can be improved.

Drawings

FIG. 1 is a flowchart illustrating a monocular image depth estimation method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of inputting an acquired training image into a pre-constructed depth prediction network for training in the embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Because the information that a single-viewpoint image can provide is relatively lacking and depends heavily on scene semantics, a method adopted when training is performed on one data set is often worse when the method is applied to another data set, and therefore, it is still challenging how to improve the generalization and accuracy of monocular depth estimation.

For depth estimation datasets, depth sensors (e.g., Kinect, laser scanner) have previously been used to obtain accurate depth measurements, but have been limited to rigid objects or sparse reconstructions. These data sets are limited in diversity and cannot be generalized to outdoor images. Another source for acquiring RGB-D images is synthetic data, which is noise-free and has an accurate depth measurement and significant depth discontinuities, but due to the domain differences between the synthetic data and the real data, domain adaptation is required to adapt to the actual application. In order to explore the diversity of the visual world, people are increasingly interested in outdoor scene images, such as: chen et al propose a field dataset DIW containing pairs of relative depth points annotated by hand, a network dataset Megadepth containing hundreds of known sights proposed by Li et al, and a dataset redebb proposed by Ke Xian et al that obtains dense relative depth maps from binocular images collected over the network.

Conventional monocular depth estimation is based on a geometric method, estimates camera pose and obtains sparse point cloud by means of SLAM or Motion recovery Structure (SfM) technology, and then obtains dense depth values by means of Multi-View Stereo (MVS) technology. Such methods can produce reconstructed images with high precision, but need to follow strict conditions of use, and furthermore, such methods do not deal well with non-rigid regions in the images. In recent years, deep learning has been rapidly developed, and a monocular image depth estimation method based on the deep learning has started to be focused. Eigen D et al first propose using a convolutional neural network for monocular depth estimation, the basic idea is to use a two-scale neural network, which is a global coarse-scale network and a local fine-scale network, respectively, the former obtains a coarse depth map with low resolution, and the latter improves the output of the former to obtain the final fine depth map, but the predicted depth map has low precision and is poor in detail. And then, Eigen D is improved on the basis, and a third scale network is added on the basis of the original network to output a picture with higher resolution. However, the method based on the multi-scale network adopts the data set with real depth to carry out supervised training, the data set is difficult to manufacture and small in quantity, and the adaptive scene and generalization capability of the algorithm are limited by the data set. Chen et al propose a method for depth prediction using relative depth, i.e., by training a field data set containing pairs of relative-depth viewpoints labeled manually, the relative depth information of the input image is predicted. Ke Xian et al randomly samples the GT depth map obtained from the binocular image and the predicted depth map generated by the depth convolution network and uses the pairwise ordering penalty proposed by Chen et al.

In summary, the depth information can be described by automatically extracting image features by using the convolutional neural network based on the deep learning, so as to obtain the depth prediction. However, the method still has the problems of poor generalization and insufficient depth map precision.

According to the technical scheme, when training is carried out, the predicted depth map and the GT depth map are subjected to combined loss function calculation of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss, the problems of geometric inconsistency and edge blurring of the predicted depth map caused by only adopting sequencing loss based on random sampling point pairs are solved, and therefore the accuracy of the predicted depth map can be improved

Fig. 1 is a flowchart illustrating a monocular image depth estimation method according to an embodiment of the present invention. Referring to fig. 1, a monocular image depth estimation method in the embodiment of the present invention may specifically include:

step S101: a training image is acquired.

Step S102: and inputting the obtained training image into a depth prediction network constructed in advance for training to obtain a corresponding prediction depth map.

Step S103: and performing joint loss calculation on the obtained prediction depth map and the corresponding GT depth map by adopting a joint loss function of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss to obtain a corresponding monocular depth estimation map.

Step S101 is executed to acquire a training image.

In specific implementation, in order to improve the generalization capability of depth prediction, the required training data set should be more diversified, the NYU v2 data set is only applicable to indoor scenes, the KITTI data set only includes road-related scenes, the Make3D data set is smaller and only includes 959 outdoor scene data, and based on this, the invention selects the redebb data set when acquiring the training data set, and the data set includes 3600 pairs of RGB images and a GT depth map. The optical flow is obtained by a binocular stereo image pair through a flonet 2.0 network, the horizontal component of the optical flow is used as a supervised depth map, but since a large-area non-texture region such as sky is contained in a partial image, a pre-trained RefineNet is used for segmenting the sky region, and the segmented sky region is set as the maximum value of pixels in a GT depth map. Meanwhile, a validation set containing 1410 images in the DIW dataset proposed by Chen et al was used as the validation dataset herein.

In the embodiment of the invention, after the training image is acquired, the method further comprises the step of preprocessing the training image sample data. Specifically, the method may include: firstly, expanding training image sample data by using a data expansion method, wherein the expanding comprises zooming, rotating and random horizontal turning to obtain a corresponding first training image; secondly, adjusting the expanded first training image to a preset resolution, such as 384 multiplied by 384, to obtain a second training image, so as to send the second training image to a depth prediction network with an encoder being ResNet; and finally, carrying out normalization processing on the sample image data set obtained through the preprocessing operation. The calculation formula adopted by the normalization processing is as follows:

wherein x is_i[channel]Three-channel image pixel values, y, representing a training image obtained after preprocessing_iRepresenting the pixel values, mean, of the training image after normalization_[channel]Mean, std, of pixel values representing the training image_[channel]Representing a standard deviation of pixel values of the training image.

In an embodiment of the present invention, the setting is mean ═ 0.485,0.456,0.406, and the setting is std ═ 0.229,0.224, 0.225.

And step S102 is executed, the obtained training image is input into a depth prediction network which is constructed in advance for training, and a corresponding prediction depth map is obtained.

In specific implementation, when the acquired training image is input into a pre-constructed depth prediction network for training:

first, the preprocessed training images are input into a preset ResNet50 network, and a plurality of specific layer feature images with successively reduced resolution are obtained.

Referring to fig. 2, for example, training image data with a preprocessed size of 384 × 384 is used as an input of the depth prediction network ResNet50, the ResNet50 network as an encoder is divided into 4 different building blocks according to the resolution of an output feature map, the feature map size (in the form of W × H × C) output by each block is 96 × 96 × 256, 48 × 48 × 512, 24 × 24 × 1024, and 12 × 12 × 2048, respectively, and the size of the final feature map is 1/32 of the input image.

Secondly, performing reverse-order traversal on the plurality of specific-layer feature images to obtain a traversed current specific-layer feature image, merging the current specific-layer feature image and a corresponding feature fusion image to generate an image with the same resolution as that of the specific-layer feature image of the previous bit order until the plurality of specific-layer feature images are traversed; and the corresponding feature fusion image is obtained by fusing the feature image of the next specific layer with the bilinear up-sampling image of the feature image of the specific layer after the residual convolution is carried out on the feature image of the next specific layer.

A simple up-sampling or deconvolution of the feature map obtained by ResNet directly will generate a coarse predicted depth map. Two methods can be used to obtain better prediction results, one is hole convolution, and the other is multi-scale feature fusion. But the former will occupy too much memory and easily generate chessboard artifact; the latter can save memory and produce high quality predictions, thus selecting multiple multi-scale feature fusion modules for use in network decoders.

Referring to fig. 2, the process of forward propagation for the feature fusion part in the decoder includes: the last set of layer-specific feature maps generated by ResNet50 are first upsampled, expanding from a size of 12 × 12 × 2048 to 24 × 24 × 2048; then, using a residual convolution block for the pre-trained specific layer feature graph, and merging the residual convolution block with the fusion feature graph generated by the previous order feature fusion module; and finally, generating a feature map with the same resolution as that of the next input block by using the combined result through a residual volume block and upsampling until the four specific-layer feature images are completely traversed.

In order to generate the final depth prediction result, the result obtained by the 3 feature fusion modules is input into an adaptive output module, and the adaptive output module comprises two 3 × 3 convolution layers and a bilinear upsampling layer, so that the predicted depth map with the final size of 384 × 384 is obtained.

In order to obtain the optimal solution of the depth prediction network model, in the embodiment of the invention, the obtained prediction depth map and the corresponding GT depth map are subjected to joint loss calculation by adopting a joint loss function of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss to obtain a corresponding monocular depth estimation map.

Specifically, before training the network, the network weight needs to be initialized, the encoder partial network weight value is initialized by a pre-trained ResNet50 network, and the decoder partial network is initialized by a random number, and the probability distribution of the encoder partial network is subject to a normal distribution with a mean value of 0 and a variance of 0.01. After the preprocessed monocular image is input into the depth prediction network in step 2, a reasonable loss function needs to be proposed as a constraint condition for optimizing network parameters. For the depth prediction image, the global accuracy and the local accuracy are evaluated, and a depth prediction result with more geometric consistency and more accurate edge can not be obtained only by adopting the sequencing loss based on random sampling, so that the accuracy of the depth prediction image is improved by adopting a combined loss function of the sequencing loss, the multi-scale structure similarity loss and the multi-scale invariant gradient matching loss. The loss function that needs to be used for training is as follows:

L＝L_rank+αL_ms-ssim+βL_grad (2)

wherein L represents the joint loss function, L_rankRepresenting the ordering penalty based on random sampling, L_ms-ssimRepresenting a multi-scale structure similarity loss function, L_gradThe method is characterized by representing the multi-scale invariant gradient matching loss, alpha represents the balance factor of the multi-scale structure similarity loss function, and beta represents the balance factor of the multi-scale invariant gradient matching loss function.

The ordering loss function is calculated as follows:

and:

where N represents the random sampling point logarithm, [ phi ] (p)_i,0,p_i,1) Representing a pair ordering penalty in the predicted depth map, representing depth values of pairs of depth points on the respective predicted depth image, l representing an ordering label of corresponding pairs of points on the GT depth map, p_i,0 ^*,p_i,1 ^*Indicates the depth value of the point pair (i,0, i,1) on the GT depth image, and τ indicates the threshold value, set to 0.02.

The multiscale structural similarity loss L _ (ms-ssim) is calculated as follows:

wherein, c_j(p,p^*)、s_j(p,p^*) Respectively representing the comparison of the predicted depth with the GT depth in terms of contrast and structure at a scale of j; l_M(p,p^*) Representing contrast in brightness only at the highest scale M; alpha is alpha_M、β_j、γ_jFor adjusting the relative importance of the different components, setting a at the scale j for simplifying the parameter selection_j＝β_j＝γ_j。

Where M represents the pixel value of the GT depth map, R_i ^sRepresenting the difference of the pixel values of the predicted depth map and the pixel values of the GT depth map at different scales, s representing the number of scales. In an embodiment of the present invention, the number of s is 4.

By adopting the technical scheme of the scheme in the embodiment of the invention, the training image is obtained, the obtained training image is input into the depth prediction network which is constructed in advance for training to obtain the corresponding prediction depth map, and the obtained prediction depth map and the corresponding GT depth map are subjected to joint loss calculation by adopting a joint loss function of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss to obtain the corresponding monocular depth estimation map. During training, the predicted depth map and the GT depth map are subjected to combined loss function calculation of sequencing loss, multi-scale structure similarity loss and multi-scale invariant gradient matching loss, the problems of geometric inconsistency and edge blurring of the predicted depth map caused by only adopting sequencing loss based on random sampling point pairs are solved, and therefore the accuracy of the predicted depth map can be improved.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A monocular image depth estimation method, comprising:

acquiring a training image;

2. The monocular image depth estimation method of claim 1, wherein before inputting the acquired training image into a pre-constructed depth prediction network for training, further comprising:

expanding the training image to obtain a first training image;

3. The monocular image depth estimation method of claim 2, wherein the augmenting the training image comprises: and performing at least one of scaling, rotation and random horizontal flipping on the training image.

4. The monocular image depth estimation method according to claim 2 or 3, wherein the inputting the acquired training image into a pre-constructed depth prediction network for training comprises:

5. The monocular image depth estimation method of claim 4, wherein the joint loss function is:

L＝L_rank+αL_ms-ssim+βL_grad；

wherein L represents the joint loss function, L_rankRepresenting the ordering penalty based on random sampling, L_ms-ssimRepresenting a multi-scale structure similarity loss function, L_gradThe method comprises the steps of representing a multi-scale invariant gradient matching loss function, alpha representing a balance factor of the multi-scale structure similarity loss function, and beta representing the balance factor of the multi-scale invariant gradient matching loss function.

6. The monocular image depth estimation method of claim 4, wherein the number of the specific layer feature images is 4.

7. The monocular image depth estimation method of claim 1, wherein the GT depth map obtains a horizontal component of an optical flow of a binocular image using flonet 2.0.