CN112967327A - Monocular depth method based on combined self-attention mechanism - Google Patents

Monocular depth method based on combined self-attention mechanism Download PDF

Info

Publication number
CN112967327A
CN112967327A CN202110239390.8A CN202110239390A CN112967327A CN 112967327 A CN112967327 A CN 112967327A CN 202110239390 A CN202110239390 A CN 202110239390A CN 112967327 A CN112967327 A CN 112967327A
Authority
CN
China
Prior art keywords
depth
self
attention
training
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110239390.8A
Other languages
Chinese (zh)
Inventor
张玉亮
赵智龙
付炜平
孟荣
范晓丹
刘洪吉
张宁
王东辉
张东坡
李兴文
曾建生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Maintenance Branch of State Grid Hebei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202110239390.8A priority Critical patent/CN112967327A/en
Publication of CN112967327A publication Critical patent/CN112967327A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a monocular depth estimation method based on a combined self-attention mechanism, which comprises the following steps: the device comprises an encoder based on a joint attention module, a decoder based on U-net and a connection module based on a characteristic pyramid; setting parameters of each convolution layer according to the resolution of the required depth map; building an estimation network model; training the network model by using a training data set, and extracting the output of a decoder; calculating the decoder output and the corresponding depth map thereof, and modifying the model parameters by combining the loss function; and performing depth prediction on the input image by using the final model after training is completed. The invention uses a space self-attention mechanism, a channel self-attention mechanism and introduces a filtering mechanism in an encoder module so as to extract depth information by using local feature mapping and combining global context information, solves the problem that the local information and the global information cannot be effectively integrated in a convolutional neural network, and further improves the precision of depth estimation.

Description

Monocular depth method based on combined self-attention mechanism
Technical Field
The invention relates to the technical field of three-dimensional image depth estimation, in particular to a monocular depth method based on a combined self-attention mechanism.
Background
Image depth information is important for a computer to understand a real-world 3D scene, and a depth estimation technology is an important component for understanding a geometric relationship in the scene and is also a basic task of environment perception, and is widely applied to tasks such as 3D image reconstruction, robot navigation, attitude estimation and positioning and map construction (SLAM). The content of the depth estimation task is to acquire depth information of an image, namely, the distance between a scene represented by each pixel point and an observer. At present, the distance information of an object is mainly obtained by using a laser radar and a depth sensor, but the sensors have the defects of high price, high requirement on the use environment and the like, for example, the laser radar has great attenuation in severe environments such as heavy rain, heavy smoke, heavy fog and the like, and the available measurement distance and the measurement precision of the laser radar are directly influenced. Therefore, obtaining distance information from images is the preferred solution in the industry.
In recent years, with the development of computer technology, deep learning has made a series of breakthrough advances in the field of computer vision, and monocular image depth estimation using the deep learning technique has become a popular field. Compared with the measurement by using a sensor, the solution scheme by computer vision has the characteristics of small size, convenience, low price and wide adaptability. The original images are usually acquired by using monocular, binocular or even multi-view cameras, and when a stereo camera is used, a fixed position and careful calibration are required, so that huge time consumption is caused, and the method is large in calculation amount and difficult to meet the requirement of a system on real-time property. Compared with sensors such as laser or stereo cameras, the monocular camera has the advantages of low cost and convenience in use. Because of its volume is less, requires lowly to service environment, uses the convenient and saving occupation space that can be very big of monocular camera deployment in equipment such as robot, unmanned car. Because the monocular image depth estimation has small calculation amount and low time complexity, the requirement of real-time calculation can be met, and more researchers begin to explore the depth estimation by using a monocular camera. In recent years, emerging Convolutional Neural Network (CNN) based methods have greatly improved the performance of monocular image depth estimation.
Laina et al propose to use a full convolution network with a residual network for depth estimation, and design an effective up-sampling method for replacing a large convolution with a small convolution and design a loss function introduced into Huber loss to obtain a good effect. Cao et al transform the regression problem of depth estimation into a discrete classification problem, and use the depth residual network of full convolution to achieve classification, and optimize the final depth estimation value using CRF. Qi et al propose a depth and surface method vector diagram for a single image joint prediction by a geometric neural network (GeoNet), and by respectively constructing two branch models from depth to vector and from vector to depth by using CNN as a main network, GeoNet integrates the geometric relationship between depth and surface normal, and the two branches effectively improve the model prediction effect and obtain higher consistency and corresponding precision. To address the problem of detail loss in monocular depth estimation networks, Hao et al propose to extract high-level context information using hole convolution while preserving spatial detail information on many feature maps. A dense feature extraction module (DFE) is obtained by combining the hole convolution with Res-Net 101. A depth map generation module (DMG) is innovatively provided by utilizing information extracted by DFEs of different levels, wherein the DMG consists of an attention module AFB (attention function Block) and a channel attenuation module CRB (channel Reduce Block). FU et al propose DORN network turn regression problem into classification problem too, this classification problem is different from ordinary classification problem, the depth value classification has certain orderliness, all sorts can be arranged according to the order from small to big, therefore propose "order regression" concept, this method adopts the space to increase the discrete sampling, has avoided the large depth value in the course of training to influence the network training ground and has obtained the good effect on the test set.
Although monocular depth estimation models based on convolutional neural networks have achieved good results, some challenges and problems remain.
At present, the predicted depth map of the depth estimation method still has the problems of edge blurring, detail loss and edge information loss, specifically, most depth estimation networks capture local information by convolution, and the acceptance domain of the convolution is very small, so that long-distance spatial information cannot be extracted more comprehensively.
In addition, a layer jump connection network used by the traditional U-net model directly connects the feature information of the coding layer to the decoding layer for feature fusion, the fusion mode can utilize multi-scale information in the coder, but the direct connection mode often brings feature redundant information, and the introduced noise also causes the detail resolution of the final depth map to be insufficient, thereby affecting the precision of the model.
Disclosure of Invention
The invention aims to provide a monocular depth method based on a combined self-attention mechanism, so that a network model can generate a feature map with higher robustness.
The invention adopts the following technical scheme:
a monocular depth method based on a combined self-attention mechanism comprises the following steps:
(1) acquiring a plurality of original training samples, and performing data enhancement operation on the original training samples to generate a training data set, wherein the original training samples comprise an original scene graph and an original depth graph;
(2) constructing a joint attention module-based encoder which takes DenseNet as a backbone network and combines a space self-attention module and a channel self-attention module;
(3) constructing a decoder with U-net as a backbone network;
(4) connecting the encoder and the decoder through an ASPP module to serve as a feature extraction framework;
(5) and (3) training by using the training data set generated in the step (1) and combining a loss function, and performing depth prediction on the input image by using the final model after training.
In the step (1), the data enhancement operation includes image cropping, random flipping, color dithering, and random rotation.
In the step (1), the training data set includes a clipped original training sample, a reversed training sample, a color dithering training sample, and a rotation training sample.
In the step (2), the spatial self-attention mechanism is as follows: recording the input characteristic diagram as X ∈ RC×H×WThe height is H, the width is W, and the number of channels is C; passing the feature matrix X through four convolution layers of 3 × 3, and not sharing the weight among the convolution layers to obtain four feature maps, which are recorded as:
Figure BDA0002961857530000031
wherein r is the downsampling rate;
adjusting MjJ is the shape of 2,3, 4;
adjusting M2Is marked as
Figure BDA0002961857530000032
Adjusting M3、M4Is marked as
Figure BDA0002961857530000033
Computing a value attention weight matrix A
Q and K areTMatrix multiplication is carried out, and A' epsilon R is obtained through calculation of a softmax function(H=W)×(H×W)
A′=softmax(QKT)
Introducing a filter matrix B epsilon R(H×W)×(H×W)Wherein the elements in matrix B are as follows:
Figure BDA0002961857530000034
multiplying B and A' to obtain a filtered weight matrix alpha
α=A′·B
Output of spatial self-attention module S ∈ RC×H×WThe following were used:
S=αV+M1
in the step (2), the channel self-attention module is:
recording the input characteristic diagram as X ∈ RC×H×WThe height is H, the width is W, and the number of channels is C;
adjusting the shape of X to obtain X' belonged to RC×(H×W)Calculating a channel self-attention weight matrix A;
Figure BDA0002961857530000041
wherein A isi,jThe value of the channel self-attention weight matrix A at the (I, j) position;
spatial self-attention module output S ∈ RC×H×WThe calculation formula is as follows:
Figure BDA0002961857530000042
in the step (3), the decoder uses four upsampling layers to gradually restore the resolution of the depth image to be consistent with the input state; and the output of each decoding layer is subjected to feature fusion with the multi-scale feature map of the coding layer connected with the skip layer.
In the step (4), the ASPP module is: obtaining 4 feature maps with different sizes by using irregular convolution of rate r 1,3,6 and 12; performing feature fusion on the obtained 4 feature graphs with different sizes by using a feature pyramid method; the output feature map of ASPP is obtained by a 1 × 1 convolution.
In step (5), the loss function is:
Loss=λLdepth+Lgrad
wherein λ is an adjustable parameter, LdepthRepresenting the distance error loss, L, between pixel points and pixel pointsgradIndicating a depth gradient loss.
Further, the distance error loss L between the pixel point and the pixel pointdepthUsing the Ruber loss function to calculate:
Figure BDA0002961857530000043
wherein, δ is a hyper-parameter threshold, which can be adaptively adjusted in the training process, and is 20% of the maximum absolute error of all image pixels in the current batch.
Further, the depth gradient loses LgradCalculated by the following formula:
Figure BDA0002961857530000044
wherein N represents the total number of pixels in the depth image;
Figure BDA0002961857530000051
and
Figure BDA0002961857530000052
representing the gradient value of the depth image pixel i predicted by the model in the horizontal direction and the gradient value of the depth image pixel in the real depth image pixel in the horizontal direction of the model;
Figure BDA0002961857530000053
and
Figure BDA0002961857530000054
representing the gradient values in the vertical direction.
The invention has the beneficial effects that: the invention realizes that a joint self-attention mechanism of different stages is introduced into a depth estimation network model encoder to process remote relation, so that the model can combine context information into a high-dimensional feature map, and a filtering mechanism is realized in a joint self-attention module, and the filtering mechanism enables a network model to generate a feature map with stronger robustness.
Drawings
FIG. 1 is the overall architecture of the model of the present invention.
Fig. 2 is a schematic diagram of a spatial self-care module.
FIG. 3 is a schematic diagram of a per-channel self-attention module.
Fig. 4 is a schematic diagram of the structure of an ASPP module.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein, and it will be apparent to those of ordinary skill in the art that the present application is not limited to the specific embodiments disclosed below.
A method for predicting image depth, as shown in fig. 1, includes the following steps.
The method comprises the following steps: obtaining a plurality of original training samples, and performing data enhancement operation on the original training samples to generate a training data set, wherein the original training samples comprise an original scene graph and an original depth graph.
A training data set is prepared. In the embodiment of the present invention, the training set and the test set are NYU Depth V2 data sets, which include 464 indoor scenes. The official method is used in the embodiment of the invention to divide the training and testing sets, wherein the training set comprises 249 scenes and the testing set comprises 215 scenes. In the training process, in order to perform a larger batch of training, the resolution of the original picture is reduced from 640 × 480 to 304 × 228 by downsampling and center cropping in the embodiment of the present invention.
In this embodiment, to avoid over-fitting and improve the accuracy of the test results, the following data enhancement method is used:
and randomly flipping, namely flipping the original scene graph and the corresponding original depth graph in the horizontal direction with a probability of 50%.
And (4) color dithering, namely scaling the brightness, the saturation and the contrast of the original scene graph according to K epsilon [0.6,1.4 ].
And randomly rotating, namely randomly plane-rotating the original scene image and the original depth image according to r e [ -5, +5] degrees.
In this embodiment, the training data set includes all samples in the original training samples, randomly flipped training samples, color dithered training samples, and randomly rotated training samples.
As shown in fig. 2, the depth estimation network of the present invention includes: encoder, decoder, ASPP linking section.
Step two: and constructing an encoder layer which takes the DenseNet as a backbone network and is based on a joint attention mechanism.
Encoder architecture based on DenseNet. The output of each Dense Block of DenseNet goes through a spatial self-attention module and a channel self-attention module, the outputs of the above two modules are summed and used as the input of the next Dense Block.
In fig. 2, S100, S101, S102 and S103 are the corresponding 4 sense blocks in the DenseNet network. S105, S106, S107 and S108 are spatial self-attention modules, and S109, S110, S111 and S112 are channel self-attention modules. S113 is an ASPP module.
In the embodiment of the present invention, the DenseNet uses the structure of DenseNet-161.
As shown in fig. 2, the spatial self-attention module in the embodiment includes:
recording the input characteristic diagram as X ∈ RC×H×WThe height is H, the width is W, and the number of channels is C;
the feature matrix X is passed through four 3 × 3 convolutional layers, which are represented by S200, S201, S202, S203 in fig. 2.
The output sizes of the convolution layers are all
Figure BDA0002961857530000061
Where r is the downsampling rate.
Adjusting the size of the characteristic diagram obtained in the above steps, and recording as:
Figure BDA0002961857530000071
and M ∈ RC×(H×W)
Q and K areTMatrix multiplication is carried out, and then calculation is carried out through a softmax function to obtain
A′∈R(H×W)×(H×W)
A′=softmax(QKT)。
Introducing a filter matrix B epsilon R(H×W)×(H×W)
Wherein, the elements in the matrix B are as follows:
Figure BDA0002961857530000072
multiplying B and A' to obtain a filtered weight matrix alpha
α=A′·B
Output of spatial self-attention module S ∈ RC×H×WThe following were used:
S=αV+M。
as shown in fig. 3, the channel self-attention module in the embodiment includes:
recording the input characteristic diagram as X ∈ RC×H×WHeight H, width W, number of channels C
Adjusting the shape of X to obtain X' belonged to RC×(H×W)
Computing a channel self-attention weight matrix A
Figure BDA0002961857530000073
Wherein A isi,jThe channel is derived from the value of the attention weight matrix A at the (I, j) position.
Spatial self-attention module output S ∈ RC×H×WThe calculation formula is as follows:
Figure BDA0002961857530000074
step three: and constructing a decoder taking the U-net as a backbone network.
The decoder uses four upsampling layers to gradually restore the resolution of the depth image to be consistent with the input state. And the output of each decoding layer is subjected to feature fusion with the multi-scale feature map of the coding layer connected with the skip layer.
In an embodiment of the present invention, the multi-scale feature fusion includes:
the upper layer feature map is firstly up-sampled through a deconvolution operation, and the up-sampling rate is equal to the down-sampling rate of the peer-to-peer layer down-sampling.
Step four: the encoder and decoder are connected by the ASPP structure.
The structure of the ASPP module is shown in fig. 4. Four feature maps of different sizes are obtained by irregular convolution at a rate r of 1,3,6,12, and feature fusion is performed on the obtained four feature maps of different sizes using a feature pyramid method. By the 1 × 1 convolution of S404, an output feature map of ASPP is obtained.
Step five: and (4) training by using the training data set generated in the step one and combining a loss function, and performing depth prediction on the input image by using the final model after training.
The involved loss function is:
Loss=λLdepth+Lgrad
wherein λ is an adjustable parameter;
Ldepthrepresenting the distance error loss between the pixel point and the pixel point;
Lgradindicating a depth gradient loss.
In the present embodiment, LdepthA Ruber loss function is used, the formula of which is as follows.
Figure BDA0002961857530000081
Wherein, δ is a hyper-parameter threshold, which can be adaptively adjusted in the training process, and is 20% of the maximum absolute error of all image pixels in the current batch.
In the present example LgradThe formula is as follows:
Figure BDA0002961857530000082
wherein N represents the total number of pixels in the depth image;
Figure BDA0002961857530000083
and
Figure BDA0002961857530000084
representing the gradient value of the depth image pixel i predicted by the model in the horizontal direction and the gradient value of the depth image pixel in the real depth image pixel in the horizontal direction of the model;
Figure BDA0002961857530000085
and
Figure BDA0002961857530000086
representing the gradient values in the vertical direction.
In an embodiment of the invention, an Adam optimizer is used to optimize the end-to-end network model herein, where the initial learning rate is 10-4Weighted decay parameter beta1And beta2Set to 0.9 and 0.999 respectively, the Batch size is set to 8, the training period is 30 epochs in total, and the learning rate is attenuated to 10% every 5 epochs.
Step six: and loading the trained weight model, inputting the test set into a depth estimation network based on a joint self-attention mechanism, and reasoning to obtain a depth image of the test set. And comparing the depth image obtained by inference with the real depth image, calculating to obtain errors and precision, and performing overall evaluation on the effect of the whole model.
In an embodiment of the present invention, 215 pairs of the NYU Depth V2 data set were used in the comparative experiment to try to slice and retain the 640 x 480 resolution of the original picture. The model was trained using a GTX1080Ti video card with 11GB video memory.
In the embodiment of the invention, a unified error evaluation index is used to compare with other algorithm results, wherein the error evaluation index is as follows:
1) root mean square error (RMS):
Figure BDA0002961857530000091
2) mean Relative Error (REL):
Figure BDA0002961857530000092
3) average log10Error:
Figure BDA0002961857530000093
4) relative error is 1.25kInner pixel ratio, where k ∈ {1,2,3 }:
Figure BDA0002961857530000094
Figure BDA0002961857530000095
results of examples of the invention the following table 1 shows the results of the invention compared with some relevant research work in recent years. To understand the impact of the various modules in the model herein on the overall performance, the present invention implements three different network models. The model is a U-net model which is not combined with a joint attention mechanism, and only the U-net model is combined with an ASPP module; the second model is a network model which introduces a joint self-attention mechanism in the coding layer, but sets a filtering threshold theta to be 0, namely does not introduce the filtering mechanism; model three sets the filtering threshold θ to 0.3, and introduces the filtering mechanism into the joint self-attention module.
TABLE 1
Figure BDA0002961857530000096
Figure BDA0002961857530000101
The result shows that the model introducing the combined self-attention mechanism and the filtering module is superior to a pure combined self-attention mechanism model, and the model introducing the combined self-attention mechanism is superior to a model not using the mechanism, so that the effectiveness of introducing the combined self-attention mechanism into an encoding layer and the effectiveness of the filtering mechanism in the combined self-attention module are proved. In comparison with the above 5 models, the method of the present invention achieves the lowest root mean square error. Comprehensively, the model can obtain the optimal performance in most indexes, and the superiority and the practical application value of the network model provided by the invention are fully verified.
In an actual application scenario, the model is initialized by loading pre-training parameters. And then shooting a scene image to be measured through a monocular camera. The method comprises the steps of adjusting the resolution of an image acquired by a monocular camera according to actual application requirements, then carrying out standardization processing on the adjusted image to enable the pixel value of the image to be between-1 and 1, inputting the preprocessed image into a model for reasoning, and obtaining depth information of the corresponding position of each pixel point through model reasoning. And according to the actual use condition, the resolution is restored by using a bilinear interpolation mode, and the depth image obtained by inference is restored to the size of the original image.
The invention provides a monocular depth method based on a combined self-attention mechanism aiming at practical application scenes. The method adopts an innovative network architecture, introduces a joint self-attention mechanism at different stages to process remote relations, enables the model to combine context information into a high-dimensional feature map, and realizes a filtering mechanism in a joint self-attention module, and the mechanism enables the network model to generate a feature map with stronger robustness, thereby greatly improving the effect of depth estimation. The invention has convenient deployment and small calculated amount, can meet the real-time requirement, can be conveniently deployed into related embedded equipment and can be widely applied to various actual scenes such as 3D image reconstruction, indoor modeling, attitude estimation, robot navigation and the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention. Although the above embodiments of the present invention have been disclosed, these specific embodiments are merely illustrative of the invention and do not limit the invention. Variations may be made by those skilled in the art without departing from the concept and scope of the invention. Therefore, the protection scope of the present invention is subject to the claims.

Claims (10)

1. A monocular depth method based on a combined self-attention mechanism is characterized by comprising the following steps:
(1) acquiring a plurality of original training samples, and performing data enhancement operation on the original training samples to generate a training data set, wherein the original training samples comprise an original scene graph and an original depth graph;
(2) constructing a joint attention module-based encoder which takes DenseNet as a backbone network and combines a space self-attention module and a channel self-attention module;
(3) constructing a decoder with U-net as a backbone network;
(4) connecting the encoder and the decoder through an ASPP module to serve as a feature extraction framework;
(5) and (3) training by using the training data set generated in the step (1) and combining a loss function, and performing depth prediction on the input image by using the final model after training.
2. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in step (1), the data enhancement operation includes image cropping, random flipping, color dithering and random rotation.
3. The method according to claim 2, wherein in step (1), the training data set comprises clipped original training samples, flipped training samples, color-dithered training samples, and rotated training samples.
4. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in step (2), the spatial self-attention mechanism is: recording the input characteristic diagram as X ∈ RC×H×WThe height is H, the width is W, and the number of channels is C; passing the feature matrix X through four 3X 3 volumesAnd (3) laminating, wherein weights are not shared among the lamination layers to obtain four characteristic graphs which are marked as:
Figure FDA0002961857520000011
wherein r is the downsampling rate;
adjusting MjJ is the shape of 2,3, 4;
adjusting M2Is marked as
Figure FDA0002961857520000012
Adjusting M3、M4Is marked as
Figure FDA0002961857520000013
Computing a value attention weight matrix A
Q and K areTMatrix multiplication is carried out, and A' epsilon R is obtained through calculation of a softmax function(H×W)×(H×W)
A′=softmax(QKT)
Introducing a filter matrix B epsilon R(H×W)×(H×W)Wherein the elements in matrix B are as follows:
Figure FDA0002961857520000021
multiplying B and A' to obtain a filtered weight matrix alpha
α=A′·B
Output of spatial self-attention module S ∈ RC×H×WAs follows
S=αV+M1
5. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in step (2), the channel self-attention module is: recording the input characteristic diagram as X ∈ RC×H×WHeight ofH, the width is W, and the number of channels is C; adjusting the shape of X to obtain X' belonged to RC×(H×W)Calculating a channel self-attention weight matrix A;
Figure FDA0002961857520000022
wherein A isi,jThe value of the channel self-attention weight matrix A at the (I, j) position;
spatial self-attention module output S ∈ RC×H×WThe calculation formula is as follows:
Figure FDA0002961857520000023
6. the monocular depth method based on the joint self-attention mechanism as claimed in claim 1, wherein, in step (3), the decoder uses four upsampling layers to gradually restore the resolution of the depth image to be consistent with the input state; and the output of each decoding layer is subjected to feature fusion with the multi-scale feature map of the coding layer connected with the skip layer.
7. The monocular depth method based on the combined attention device according to claim 1, wherein in step (4), the ASPP module is: obtaining 4 feature maps with different sizes by using irregular convolution of rate r 1,3,6 and 12; performing feature fusion on the obtained 4 feature graphs with different sizes by using a feature pyramid method; the output feature map of ASPP is obtained by a 1 × 1 convolution.
8. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in the step (5), the loss function is:
Loss=λLdepth+Lgrad
wherein λ is an adjustable parameter, LdepthRepresenting pixel points and pixelsLoss of distance error between, LgradIndicating a depth gradient loss.
9. The method of claim 8, wherein the distance error between the pixel point and the pixel point is lost by LdepthUsing the Ruber loss function to calculate:
Figure FDA0002961857520000031
wherein, δ is a hyper-parameter threshold, which can be adaptively adjusted in the training process, and is 20% of the maximum absolute error of all image pixels in the current batch.
10. The method of claim 8, wherein the depth gradient penalty L is a function of a combined attention-machine systemgradCalculated by the following formula:
Figure FDA0002961857520000032
wherein N represents the sum of pixels number +in the depth imagexyiAnd +xyi' represents a gradient value in a horizontal direction at a depth image pixel i predicted by the model and a gradient value in a horizontal direction of a true depth image pixel of the modelyyiAnd +yyi' represents a gradient value in the vertical direction.
CN202110239390.8A 2021-03-04 2021-03-04 Monocular depth method based on combined self-attention mechanism Pending CN112967327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110239390.8A CN112967327A (en) 2021-03-04 2021-03-04 Monocular depth method based on combined self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110239390.8A CN112967327A (en) 2021-03-04 2021-03-04 Monocular depth method based on combined self-attention mechanism

Publications (1)

Publication Number Publication Date
CN112967327A true CN112967327A (en) 2021-06-15

Family

ID=76276461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110239390.8A Pending CN112967327A (en) 2021-03-04 2021-03-04 Monocular depth method based on combined self-attention mechanism

Country Status (1)

Country Link
CN (1) CN112967327A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591955A (en) * 2021-07-20 2021-11-02 首都师范大学 Method, system, equipment and medium for extracting global information of graph data
CN113705432A (en) * 2021-08-26 2021-11-26 京东鲲鹏(江苏)科技有限公司 Model training and three-dimensional target detection method, device, equipment and medium
CN114972882A (en) * 2022-06-17 2022-08-30 西安交通大学 Wear surface damage depth estimation method and system based on multi-attention machine system
CN116310150A (en) * 2023-05-17 2023-06-23 广东皮阿诺科学艺术家居股份有限公司 Furniture multi-view three-dimensional model reconstruction method based on multi-scale feature fusion
WO2023232086A1 (en) * 2022-05-31 2023-12-07 中兴通讯股份有限公司 Foreground and background segmentation method, electronic device and computer-readable medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204010A (en) * 2017-04-28 2017-09-26 中国科学院计算技术研究所 A kind of monocular image depth estimation method and system
CN110060286A (en) * 2019-04-25 2019-07-26 东北大学 A kind of monocular depth estimation method
CN110610129A (en) * 2019-08-05 2019-12-24 华中科技大学 Deep learning face recognition system and method based on self-attention mechanism
CN110992414A (en) * 2019-11-05 2020-04-10 天津大学 Indoor monocular scene depth estimation method based on convolutional neural network
CN111739078A (en) * 2020-06-15 2020-10-02 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism
CN111739082A (en) * 2020-06-15 2020-10-02 大连理工大学 Stereo vision unsupervised depth estimation method based on convolutional neural network
AU2020103715A4 (en) * 2020-11-27 2021-02-11 Beijing University Of Posts And Telecommunications Method of monocular depth estimation based on joint self-attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204010A (en) * 2017-04-28 2017-09-26 中国科学院计算技术研究所 A kind of monocular image depth estimation method and system
CN110060286A (en) * 2019-04-25 2019-07-26 东北大学 A kind of monocular depth estimation method
CN110610129A (en) * 2019-08-05 2019-12-24 华中科技大学 Deep learning face recognition system and method based on self-attention mechanism
CN110992414A (en) * 2019-11-05 2020-04-10 天津大学 Indoor monocular scene depth estimation method based on convolutional neural network
CN111739078A (en) * 2020-06-15 2020-10-02 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism
CN111739082A (en) * 2020-06-15 2020-10-02 大连理工大学 Stereo vision unsupervised depth estimation method based on convolutional neural network
AU2020103715A4 (en) * 2020-11-27 2021-02-11 Beijing University Of Posts And Telecommunications Method of monocular depth estimation based on joint self-attention mechanism

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591955A (en) * 2021-07-20 2021-11-02 首都师范大学 Method, system, equipment and medium for extracting global information of graph data
CN113591955B (en) * 2021-07-20 2023-10-13 首都师范大学 Method, system, equipment and medium for extracting global information of graph data
CN113705432A (en) * 2021-08-26 2021-11-26 京东鲲鹏(江苏)科技有限公司 Model training and three-dimensional target detection method, device, equipment and medium
WO2023232086A1 (en) * 2022-05-31 2023-12-07 中兴通讯股份有限公司 Foreground and background segmentation method, electronic device and computer-readable medium
CN114972882A (en) * 2022-06-17 2022-08-30 西安交通大学 Wear surface damage depth estimation method and system based on multi-attention machine system
CN114972882B (en) * 2022-06-17 2024-03-01 西安交通大学 Wear surface damage depth estimation method and system based on multi-attention mechanism
CN116310150A (en) * 2023-05-17 2023-06-23 广东皮阿诺科学艺术家居股份有限公司 Furniture multi-view three-dimensional model reconstruction method based on multi-scale feature fusion
CN116310150B (en) * 2023-05-17 2023-09-01 广东皮阿诺科学艺术家居股份有限公司 Furniture multi-view three-dimensional model reconstruction method based on multi-scale feature fusion

Similar Documents

Publication Publication Date Title
CN112967327A (en) Monocular depth method based on combined self-attention mechanism
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN111462329B (en) Three-dimensional reconstruction method of unmanned aerial vehicle aerial image based on deep learning
CN111915530B (en) End-to-end-based haze concentration self-adaptive neural network image defogging method
CN110728658A (en) High-resolution remote sensing image weak target detection method based on deep learning
CN111192200A (en) Image super-resolution reconstruction method based on fusion attention mechanism residual error network
CN108062769B (en) Rapid depth recovery method for three-dimensional reconstruction
CN110349087B (en) RGB-D image high-quality grid generation method based on adaptive convolution
CN108734675A (en) Image recovery method based on mixing sparse prior model
CN113129272A (en) Defect detection method and device based on denoising convolution self-encoder
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN111861884A (en) Satellite cloud image super-resolution reconstruction method based on deep learning
CN111524232A (en) Three-dimensional modeling method and device and server
CN116309122A (en) Phase fringe image speckle noise suppression method based on deep learning
CN116757955A (en) Multi-fusion comparison network based on full-dimensional dynamic convolution
CN115761178A (en) Multi-view three-dimensional reconstruction method based on implicit neural representation
CN107529647B (en) Cloud picture cloud amount calculation method based on multilayer unsupervised sparse learning network
Zhang et al. Dense haze removal based on dynamic collaborative inference learning for remote sensing images
CN113160085B (en) Water bloom shielding image data collection method based on generation countermeasure network
Manimaran et al. Focal-WNet: An Architecture Unifying Convolution and Attention for Depth Estimation
CN112418229A (en) Unmanned ship marine scene image real-time segmentation method based on deep learning
CN117422619A (en) Training method of image reconstruction model, image reconstruction method, device and equipment
CN112132207A (en) Target detection neural network construction method based on multi-branch feature mapping
CN116563101A (en) Unmanned aerial vehicle image blind super-resolution reconstruction method based on frequency domain residual error
CN116385281A (en) Remote sensing image denoising method based on real noise model and generated countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination