CN111539922B

CN111539922B - Monocular depth estimation and surface normal vector estimation method based on multitask network

Info

Publication number: CN111539922B
Application number: CN202010303011.2A
Authority: CN
Inventors: 洪思宇; 郭裕兰; 符智恒; 黄小红
Original assignee: Sun Yat Sen University
Current assignee: National University of Defense Technology; Sun Yat Sen University
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2023-03-31
Anticipated expiration: 2040-04-17
Also published as: CN111539922A

Abstract

The invention discloses a monocular depth estimation and surface normal vector estimation method based on a multitask network, which comprises the following steps: collecting multi-scale information by using a high-resolution network as a backbone network; outputting features with different resolutions through a high-resolution network, and independently up-sampling the features to obtain a feature map which is the same as the original resolution; connecting the obtained feature maps in series to obtain a multi-scale surface feature, and generating a multi-scale fusion feature; dividing the multi-scale fusion features into 2 branch features, and inputting the branch features into a cross-correlation attention mechanism interaction module to obtain a cross-correlation matrix of learning correlation; inputting the 1x1 continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining a new fusion feature by utilizing a part which is beneficial to interaction on the attention diagrams; and repeating the step S5 to obtain the characteristic information of the specific task, and finally obtaining monocular depth estimation and surface normal vector estimation results.

Description

Monocular depth estimation and surface normal vector estimation method based on multitask network

Technical Field

The invention relates to the field of computer software, in particular to a monocular depth estimation and surface normal vector estimation method based on a multitask network.

Background

Scene depth information plays a crucial role in many research topics such as three-dimensional reconstruction, obstacle detection, visual navigation, etc. Zhenyu Zhang et al proposed a monocular depth estimation and semantic segmentation method TRL based on a multitask network in 2018. The depth features and the semantic features extracted from the RGB images are subjected to weighted splicing, new depth features and semantic features are obtained through the interaction mode and are used for subsequent semantic segmentation and monocular depth estimation.

The TRL network performs interactive fusion on multitask characteristics in a decoder part (decoder). In the process, only the depth feature and the weighted semantic feature are spliced, and meanwhile, the semantic feature and the weighted depth feature are spliced. The simple feature splicing and fusion lacks theoretical guidance, and the feature graph obtained from the simple feature splicing and fusion cannot fully utilize feature information for interaction.

The PAPNet is also a monocular depth estimation, semantic segmentation and surface normal vector estimation method based on a multitask network, and is different from the network Ldid in that the method does not directly interact with features in the interaction process, but obtains an affinity matrix through the features, and performs weighted summation on the affinity matrix of each task. Its performance is much higher compared to Ldid.

PAPNet also performs inter-fusion of multitask features in the decoder part (decoder). And each branch outputs corresponding task characteristics and simultaneously outputs a corresponding affinity matrix. For example, for a depth estimation task, pixel-by-pixel addition is performed on an affinity matrix of a depth feature and an affinity matrix of a weighted semantic feature and a surface normal vector feature to obtain a new affinity matrix, and the depth feature is multiplied by the affinity matrix and fused into a new depth feature to be used for subsequent monocular depth estimation. The disadvantage of this method is that the affinity matrix must be obtained first and then interacted with, which is an indirect interaction, and the interaction is not directly performed on the features, so that the feature information cannot be fully utilized.

Disclosure of Invention

The invention aims to solve the problem of characteristic interaction between TRL and PAPnet and construct a module for directly utilizing and screening characteristic information for interaction. Compared with TRL, the method adopts cross correlation as theoretical guidance to perform feature fusion; compared with PAPNet, feature interaction can be directly and rapidly carried out.

In order to achieve the purpose, the invention adopts the following technical scheme:

a monocular depth estimation and surface normal vector estimation method based on a multitask network comprises the following steps:

s1, collecting multi-scale information by adopting a high-resolution network as a backbone network;

s2, outputting characteristics with different resolutions through a high-resolution network, and independently up-sampling the characteristics to obtain a characteristic diagram which is the same as the original resolution;

s3, connecting the obtained feature maps in series to obtain a multi-scale surface feature, and generating a multi-scale fusion feature;

s4, dividing the multi-scale fusion features into 2 branch features, and inputting the branch features into a cross-correlation attention mechanism interaction module to obtain a cross-correlation matrix of learning correlation;

s5, inputting the 1x1 continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining new fusion features by utilizing the parts which are beneficial to interaction on the attention diagrams;

and S6, repeating the step S5 to obtain the characteristic information of the specific task, and finally obtaining the monocular depth estimation and surface normal vector estimation results.

Preferably, the high resolution network outputs 4 features with different resolutions, including: f1, F2, F3 and F4.

Preferably, the multi-scale surface feature is Fn.

Preferably, the cross-correlation attention map is a probability map with weights between 0 and 1.

The invention has the beneficial effect that a module for directly utilizing and screening the characteristic information for interaction is constructed. Compared with TRL, the method adopts cross correlation as theoretical guidance to perform feature fusion; compared with the PAPnet, the method can directly and quickly perform feature interaction.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

fig. 2 is a schematic flow chart of the cross-correlation attention mechanism interaction module in fig. 1.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the following examples are provided to illustrate the detailed embodiments and specific operations based on the technical solutions of the present invention, but the scope of the present invention is not limited to the examples.

As shown in fig. 1 and fig. 2, the present invention is a method for monocular depth estimation and surface normal vector estimation based on a multitask network, the method includes the following steps:

s1, adopting a high-resolution network as multi-scale information of a backbone network set;

s4, dividing the multi-scale fusion features into 2 branch features, inputting the branch features into a cross-correlation attention mechanism interaction module, and obtaining a cross-correlation matrix of learning correlation;

Preferably, the multi-scale surface features are: fn is used.

Example 1

To validate the technical solution of the present invention, CPNet was evaluated on NYUv2 indoor data set, which contains 12 million RGB images and depth maps. From which a surface method vector map was calculated and the method of the invention was evaluated by segmenting the official dataset into 1.2 ten thousand images for training and 654 images for validation. Furthermore, a unified evaluation criterion is used to obtain the metric of the inventive method. CPNet was achieved by using PyTorch and trained on RTX2080Ti ab initio.

The depth estimation results on the NYUv2 test set are given in the following table:

CPNet has a mean square error RMSE of 0.431 at the primary evaluation index, which is better than the most advanced methods (e.g., PAPNet and TRL) by more than 0.06.

Surface normal vector estimation results on NYUv2 test set:

the error median value RMSE of the CPNet in the main evaluation index is 21.3, which is very close to that of the most advanced method (such as PAPnet and the like) and only differs by 3.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. The monocular depth estimation and surface normal vector estimation method based on the multitask network is characterized by comprising the following steps of:

s1, collecting multi-scale information by using a high-resolution network as a backbone network;

s5, inputting the continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining new fusion features by utilizing the parts which are beneficial to interaction on the attention diagrams;

2. The method of claim 1, wherein the high-resolution network outputs 4 features with different resolutions, and comprises: f1, F2, F3 and F4.

3. The method of claim 1, wherein the multi-scale surface features are: fn is used.

4. The method of claim 1, wherein the cross-correlation attention map is a probability map with a weight between 0 and 1.