CN111539922B - Monocular depth estimation and surface normal vector estimation method based on multitask network - Google Patents

Monocular depth estimation and surface normal vector estimation method based on multitask network Download PDF

Info

Publication number
CN111539922B
CN111539922B CN202010303011.2A CN202010303011A CN111539922B CN 111539922 B CN111539922 B CN 111539922B CN 202010303011 A CN202010303011 A CN 202010303011A CN 111539922 B CN111539922 B CN 111539922B
Authority
CN
China
Prior art keywords
features
feature
correlation
scale
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010303011.2A
Other languages
Chinese (zh)
Other versions
CN111539922A (en
Inventor
洪思宇
郭裕兰
符智恒
黄小红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010303011.2A priority Critical patent/CN111539922B/en
Publication of CN111539922A publication Critical patent/CN111539922A/en
Application granted granted Critical
Publication of CN111539922B publication Critical patent/CN111539922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monocular depth estimation and surface normal vector estimation method based on a multitask network, which comprises the following steps: collecting multi-scale information by using a high-resolution network as a backbone network; outputting features with different resolutions through a high-resolution network, and independently up-sampling the features to obtain a feature map which is the same as the original resolution; connecting the obtained feature maps in series to obtain a multi-scale surface feature, and generating a multi-scale fusion feature; dividing the multi-scale fusion features into 2 branch features, and inputting the branch features into a cross-correlation attention mechanism interaction module to obtain a cross-correlation matrix of learning correlation; inputting the 1x1 continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining a new fusion feature by utilizing a part which is beneficial to interaction on the attention diagrams; and repeating the step S5 to obtain the characteristic information of the specific task, and finally obtaining monocular depth estimation and surface normal vector estimation results.

Description

Monocular depth estimation and surface normal vector estimation method based on multitask network
Technical Field
The invention relates to the field of computer software, in particular to a monocular depth estimation and surface normal vector estimation method based on a multitask network.
Background
Scene depth information plays a crucial role in many research topics such as three-dimensional reconstruction, obstacle detection, visual navigation, etc. Zhenyu Zhang et al proposed a monocular depth estimation and semantic segmentation method TRL based on a multitask network in 2018. The depth features and the semantic features extracted from the RGB images are subjected to weighted splicing, new depth features and semantic features are obtained through the interaction mode and are used for subsequent semantic segmentation and monocular depth estimation.
The TRL network performs interactive fusion on multitask characteristics in a decoder part (decoder). In the process, only the depth feature and the weighted semantic feature are spliced, and meanwhile, the semantic feature and the weighted depth feature are spliced. The simple feature splicing and fusion lacks theoretical guidance, and the feature graph obtained from the simple feature splicing and fusion cannot fully utilize feature information for interaction.
The PAPNet is also a monocular depth estimation, semantic segmentation and surface normal vector estimation method based on a multitask network, and is different from the network Ldid in that the method does not directly interact with features in the interaction process, but obtains an affinity matrix through the features, and performs weighted summation on the affinity matrix of each task. Its performance is much higher compared to Ldid.
PAPNet also performs inter-fusion of multitask features in the decoder part (decoder). And each branch outputs corresponding task characteristics and simultaneously outputs a corresponding affinity matrix. For example, for a depth estimation task, pixel-by-pixel addition is performed on an affinity matrix of a depth feature and an affinity matrix of a weighted semantic feature and a surface normal vector feature to obtain a new affinity matrix, and the depth feature is multiplied by the affinity matrix and fused into a new depth feature to be used for subsequent monocular depth estimation. The disadvantage of this method is that the affinity matrix must be obtained first and then interacted with, which is an indirect interaction, and the interaction is not directly performed on the features, so that the feature information cannot be fully utilized.
Disclosure of Invention
The invention aims to solve the problem of characteristic interaction between TRL and PAPnet and construct a module for directly utilizing and screening characteristic information for interaction. Compared with TRL, the method adopts cross correlation as theoretical guidance to perform feature fusion; compared with PAPNet, feature interaction can be directly and rapidly carried out.
In order to achieve the purpose, the invention adopts the following technical scheme:
a monocular depth estimation and surface normal vector estimation method based on a multitask network comprises the following steps:
s1, collecting multi-scale information by adopting a high-resolution network as a backbone network;
s2, outputting characteristics with different resolutions through a high-resolution network, and independently up-sampling the characteristics to obtain a characteristic diagram which is the same as the original resolution;
s3, connecting the obtained feature maps in series to obtain a multi-scale surface feature, and generating a multi-scale fusion feature;
s4, dividing the multi-scale fusion features into 2 branch features, and inputting the branch features into a cross-correlation attention mechanism interaction module to obtain a cross-correlation matrix of learning correlation;
s5, inputting the 1x1 continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining new fusion features by utilizing the parts which are beneficial to interaction on the attention diagrams;
and S6, repeating the step S5 to obtain the characteristic information of the specific task, and finally obtaining the monocular depth estimation and surface normal vector estimation results.
Preferably, the high resolution network outputs 4 features with different resolutions, including: f1, F2, F3 and F4.
Preferably, the multi-scale surface feature is Fn.
Preferably, the cross-correlation attention map is a probability map with weights between 0 and 1.
The invention has the beneficial effect that a module for directly utilizing and screening the characteristic information for interaction is constructed. Compared with TRL, the method adopts cross correlation as theoretical guidance to perform feature fusion; compared with the PAPnet, the method can directly and quickly perform feature interaction.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
fig. 2 is a schematic flow chart of the cross-correlation attention mechanism interaction module in fig. 1.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the following examples are provided to illustrate the detailed embodiments and specific operations based on the technical solutions of the present invention, but the scope of the present invention is not limited to the examples.
As shown in fig. 1 and fig. 2, the present invention is a method for monocular depth estimation and surface normal vector estimation based on a multitask network, the method includes the following steps:
s1, adopting a high-resolution network as multi-scale information of a backbone network set;
s2, outputting characteristics with different resolutions through a high-resolution network, and independently up-sampling the characteristics to obtain a characteristic diagram which is the same as the original resolution;
s3, connecting the obtained feature maps in series to obtain a multi-scale surface feature, and generating a multi-scale fusion feature;
s4, dividing the multi-scale fusion features into 2 branch features, inputting the branch features into a cross-correlation attention mechanism interaction module, and obtaining a cross-correlation matrix of learning correlation;
s5, inputting the 1x1 continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining new fusion features by utilizing the parts which are beneficial to interaction on the attention diagrams;
and S6, repeating the step S5 to obtain the characteristic information of the specific task, and finally obtaining the monocular depth estimation and surface normal vector estimation results.
Preferably, the high resolution network outputs 4 features with different resolutions, including: f1, F2, F3 and F4.
Preferably, the multi-scale surface features are: fn is used.
Preferably, the cross-correlation attention map is a probability map with weights between 0 and 1.
Example 1
To validate the technical solution of the present invention, CPNet was evaluated on NYUv2 indoor data set, which contains 12 million RGB images and depth maps. From which a surface method vector map was calculated and the method of the invention was evaluated by segmenting the official dataset into 1.2 ten thousand images for training and 654 images for validation. Furthermore, a unified evaluation criterion is used to obtain the metric of the inventive method. CPNet was achieved by using PyTorch and trained on RTX2080Ti ab initio.
The depth estimation results on the NYUv2 test set are given in the following table:
Figure BDA0002454721120000051
CPNet has a mean square error RMSE of 0.431 at the primary evaluation index, which is better than the most advanced methods (e.g., PAPNet and TRL) by more than 0.06.
Surface normal vector estimation results on NYUv2 test set:
Figure BDA0002454721120000052
the error median value RMSE of the CPNet in the main evaluation index is 21.3, which is very close to that of the most advanced method (such as PAPnet and the like) and only differs by 3.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims (4)

1. The monocular depth estimation and surface normal vector estimation method based on the multitask network is characterized by comprising the following steps of:
s1, collecting multi-scale information by using a high-resolution network as a backbone network;
s2, outputting characteristics with different resolutions through a high-resolution network, and independently up-sampling the characteristics to obtain a characteristic diagram which is the same as the original resolution;
s3, connecting the obtained feature maps in series to obtain a multi-scale surface feature, and generating a multi-scale fusion feature;
s4, dividing the multi-scale fusion features into 2 branch features, and inputting the branch features into a cross-correlation attention mechanism interaction module to obtain a cross-correlation matrix of learning correlation;
s5, inputting the continuous convolution layer of each branch feature, obtaining two cross-correlation attention diagrams through softmax operation, and obtaining new fusion features by utilizing the parts which are beneficial to interaction on the attention diagrams;
and S6, repeating the step S5 to obtain the characteristic information of the specific task, and finally obtaining the monocular depth estimation and surface normal vector estimation results.
2. The method of claim 1, wherein the high-resolution network outputs 4 features with different resolutions, and comprises: f1, F2, F3 and F4.
3. The method of claim 1, wherein the multi-scale surface features are: fn is used.
4. The method of claim 1, wherein the cross-correlation attention map is a probability map with a weight between 0 and 1.
CN202010303011.2A 2020-04-17 2020-04-17 Monocular depth estimation and surface normal vector estimation method based on multitask network Active CN111539922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010303011.2A CN111539922B (en) 2020-04-17 2020-04-17 Monocular depth estimation and surface normal vector estimation method based on multitask network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010303011.2A CN111539922B (en) 2020-04-17 2020-04-17 Monocular depth estimation and surface normal vector estimation method based on multitask network

Publications (2)

Publication Number Publication Date
CN111539922A CN111539922A (en) 2020-08-14
CN111539922B true CN111539922B (en) 2023-03-31

Family

ID=71974956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010303011.2A Active CN111539922B (en) 2020-04-17 2020-04-17 Monocular depth estimation and surface normal vector estimation method based on multitask network

Country Status (1)

Country Link
CN (1) CN111539922B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819876B (en) * 2021-02-13 2024-02-27 西北工业大学 Monocular vision depth estimation method based on deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060286A (en) * 2019-04-25 2019-07-26 东北大学 A kind of monocular depth estimation method
CN110120049A (en) * 2019-04-15 2019-08-13 天津大学 By single image Combined estimator scene depth and semantic method
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN110197182A (en) * 2019-06-11 2019-09-03 中国电子科技集团公司第五十四研究所 Remote sensing image semantic segmentation method based on contextual information and attention mechanism
CN110738697A (en) * 2019-10-10 2020-01-31 福州大学 Monocular depth estimation method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120049A (en) * 2019-04-15 2019-08-13 天津大学 By single image Combined estimator scene depth and semantic method
CN110060286A (en) * 2019-04-25 2019-07-26 东北大学 A kind of monocular depth estimation method
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN110197182A (en) * 2019-06-11 2019-09-03 中国电子科技集团公司第五十四研究所 Remote sensing image semantic segmentation method based on contextual information and attention mechanism
CN110738697A (en) * 2019-10-10 2020-01-31 福州大学 Monocular depth estimation method based on deep learning

Also Published As

Publication number Publication date
CN111539922A (en) 2020-08-14

Similar Documents

Publication Publication Date Title
Wang et al. SaliencyGAN: Deep learning semisupervised salient object detection in the fog of IoT
CN109508681B (en) Method and device for generating human body key point detection model
CN108647585B (en) Traffic identifier detection method based on multi-scale circulation attention network
CN108876792B (en) Semantic segmentation method, device and system and storage medium
CN112348870B (en) Significance target detection method based on residual error fusion
CN110020658B (en) Salient object detection method based on multitask deep learning
CN112287983B (en) Remote sensing image target extraction system and method based on deep learning
CN112801047B (en) Defect detection method and device, electronic equipment and readable storage medium
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
GB2579262A (en) Space-time memory network for locating target object in video content
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN114693952A (en) RGB-D significance target detection method based on multi-modal difference fusion network
CN111739037A (en) Semantic segmentation method for indoor scene RGB-D image
CN116343043A (en) Remote sensing image change detection method with multi-scale feature fusion function
CN116091908A (en) Multi-scale feature enhancement and training method and device for underwater sonar small target detection
CN111539922B (en) Monocular depth estimation and surface normal vector estimation method based on multitask network
CN114529793A (en) Depth image restoration system and method based on gating cycle feature fusion
CN115578260B (en) Attention method and system for directional decoupling of image super-resolution
CN115393868A (en) Text detection method and device, electronic equipment and storage medium
CN115457385A (en) Building change detection method based on lightweight network
CN114693951A (en) RGB-D significance target detection method based on global context information exploration
CN113706636A (en) Method and device for identifying tampered image
JP2018124740A (en) Image retrieval system, image retrieval method and image retrieval program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240418

Address after: 510000 No. 135 West Xingang Road, Guangdong, Guangzhou

Patentee after: SUN YAT-SEN University

Country or region after: China

Patentee after: National University of Defense Technology

Address before: 510275 No. 135 West Xingang Road, Guangdong, Guangzhou

Patentee before: SUN YAT-SEN University

Country or region before: China