CN112967327A

CN112967327A - Monocular depth method based on combined self-attention mechanism

Info

Publication number: CN112967327A
Application number: CN202110239390.8A
Authority: CN
Inventors: 张玉亮; 赵智龙; 付炜平; 孟荣; 范晓丹; 刘洪吉; 张宁; 王东辉; 张东坡; 李兴文; 曾建生
Original assignee: State Grid Corp of China SGCC; Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-15

Abstract

The invention relates to a monocular depth estimation method based on a combined self-attention mechanism, which comprises the following steps: the device comprises an encoder based on a joint attention module, a decoder based on U-net and a connection module based on a characteristic pyramid; setting parameters of each convolution layer according to the resolution of the required depth map; building an estimation network model; training the network model by using a training data set, and extracting the output of a decoder; calculating the decoder output and the corresponding depth map thereof, and modifying the model parameters by combining the loss function; and performing depth prediction on the input image by using the final model after training is completed. The invention uses a space self-attention mechanism, a channel self-attention mechanism and introduces a filtering mechanism in an encoder module so as to extract depth information by using local feature mapping and combining global context information, solves the problem that the local information and the global information cannot be effectively integrated in a convolutional neural network, and further improves the precision of depth estimation.

Description

Monocular depth method based on combined self-attention mechanism

Technical Field

The invention relates to the technical field of three-dimensional image depth estimation, in particular to a monocular depth method based on a combined self-attention mechanism.

Background

Image depth information is important for a computer to understand a real-world 3D scene, and a depth estimation technology is an important component for understanding a geometric relationship in the scene and is also a basic task of environment perception, and is widely applied to tasks such as 3D image reconstruction, robot navigation, attitude estimation and positioning and map construction (SLAM). The content of the depth estimation task is to acquire depth information of an image, namely, the distance between a scene represented by each pixel point and an observer. At present, the distance information of an object is mainly obtained by using a laser radar and a depth sensor, but the sensors have the defects of high price, high requirement on the use environment and the like, for example, the laser radar has great attenuation in severe environments such as heavy rain, heavy smoke, heavy fog and the like, and the available measurement distance and the measurement precision of the laser radar are directly influenced. Therefore, obtaining distance information from images is the preferred solution in the industry.

In recent years, with the development of computer technology, deep learning has made a series of breakthrough advances in the field of computer vision, and monocular image depth estimation using the deep learning technique has become a popular field. Compared with the measurement by using a sensor, the solution scheme by computer vision has the characteristics of small size, convenience, low price and wide adaptability. The original images are usually acquired by using monocular, binocular or even multi-view cameras, and when a stereo camera is used, a fixed position and careful calibration are required, so that huge time consumption is caused, and the method is large in calculation amount and difficult to meet the requirement of a system on real-time property. Compared with sensors such as laser or stereo cameras, the monocular camera has the advantages of low cost and convenience in use. Because of its volume is less, requires lowly to service environment, uses the convenient and saving occupation space that can be very big of monocular camera deployment in equipment such as robot, unmanned car. Because the monocular image depth estimation has small calculation amount and low time complexity, the requirement of real-time calculation can be met, and more researchers begin to explore the depth estimation by using a monocular camera. In recent years, emerging Convolutional Neural Network (CNN) based methods have greatly improved the performance of monocular image depth estimation.

Laina et al propose to use a full convolution network with a residual network for depth estimation, and design an effective up-sampling method for replacing a large convolution with a small convolution and design a loss function introduced into Huber loss to obtain a good effect. Cao et al transform the regression problem of depth estimation into a discrete classification problem, and use the depth residual network of full convolution to achieve classification, and optimize the final depth estimation value using CRF. Qi et al propose a depth and surface method vector diagram for a single image joint prediction by a geometric neural network (GeoNet), and by respectively constructing two branch models from depth to vector and from vector to depth by using CNN as a main network, GeoNet integrates the geometric relationship between depth and surface normal, and the two branches effectively improve the model prediction effect and obtain higher consistency and corresponding precision. To address the problem of detail loss in monocular depth estimation networks, Hao et al propose to extract high-level context information using hole convolution while preserving spatial detail information on many feature maps. A dense feature extraction module (DFE) is obtained by combining the hole convolution with Res-Net 101. A depth map generation module (DMG) is innovatively provided by utilizing information extracted by DFEs of different levels, wherein the DMG consists of an attention module AFB (attention function Block) and a channel attenuation module CRB (channel Reduce Block). FU et al propose DORN network turn regression problem into classification problem too, this classification problem is different from ordinary classification problem, the depth value classification has certain orderliness, all sorts can be arranged according to the order from small to big, therefore propose "order regression" concept, this method adopts the space to increase the discrete sampling, has avoided the large depth value in the course of training to influence the network training ground and has obtained the good effect on the test set.

Although monocular depth estimation models based on convolutional neural networks have achieved good results, some challenges and problems remain.

At present, the predicted depth map of the depth estimation method still has the problems of edge blurring, detail loss and edge information loss, specifically, most depth estimation networks capture local information by convolution, and the acceptance domain of the convolution is very small, so that long-distance spatial information cannot be extracted more comprehensively.

In addition, a layer jump connection network used by the traditional U-net model directly connects the feature information of the coding layer to the decoding layer for feature fusion, the fusion mode can utilize multi-scale information in the coder, but the direct connection mode often brings feature redundant information, and the introduced noise also causes the detail resolution of the final depth map to be insufficient, thereby affecting the precision of the model.

Disclosure of Invention

The invention aims to provide a monocular depth method based on a combined self-attention mechanism, so that a network model can generate a feature map with higher robustness.

The invention adopts the following technical scheme:

a monocular depth method based on a combined self-attention mechanism comprises the following steps:

(1) acquiring a plurality of original training samples, and performing data enhancement operation on the original training samples to generate a training data set, wherein the original training samples comprise an original scene graph and an original depth graph;

(2) constructing a joint attention module-based encoder which takes DenseNet as a backbone network and combines a space self-attention module and a channel self-attention module;

(3) constructing a decoder with U-net as a backbone network;

(4) connecting the encoder and the decoder through an ASPP module to serve as a feature extraction framework;

(5) and (3) training by using the training data set generated in the step (1) and combining a loss function, and performing depth prediction on the input image by using the final model after training.

In the step (1), the data enhancement operation includes image cropping, random flipping, color dithering, and random rotation.

In the step (1), the training data set includes a clipped original training sample, a reversed training sample, a color dithering training sample, and a rotation training sample.

In the step (2), the spatial self-attention mechanism is as follows: recording the input characteristic diagram as X ∈ R^C×H×WThe height is H, the width is W, and the number of channels is C; passing the feature matrix X through four convolution layers of 3 × 3, and not sharing the weight among the convolution layers to obtain four feature maps, which are recorded as:

wherein r is the downsampling rate;

adjusting M_jJ is the shape of 2,3, 4;

adjusting M₂Is marked as

Adjusting M₃、M₄Is marked as

Computing a value attention weight matrix A

Q and K are^TMatrix multiplication is carried out, and A' epsilon R is obtained through calculation of a softmax function^{(H＝W)×(H×W)}

A′＝softmax(QK^T)

Introducing a filter matrix B epsilon R^{(H×W)×(H×W)}Wherein the elements in matrix B are as follows:

multiplying B and A' to obtain a filtered weight matrix alpha

α＝A′·B

Output of spatial self-attention module S ∈ R^C×H×WThe following were used:

S＝αV+M₁。

in the step (2), the channel self-attention module is:

recording the input characteristic diagram as X ∈ R^C×H×WThe height is H, the width is W, and the number of channels is C;

adjusting the shape of X to obtain X' belonged to R^C×(H×W)Calculating a channel self-attention weight matrix A;

wherein A is_i,jThe value of the channel self-attention weight matrix A at the (I, j) position;

spatial self-attention module output S ∈ R^C×H×WThe calculation formula is as follows:

in the step (3), the decoder uses four upsampling layers to gradually restore the resolution of the depth image to be consistent with the input state; and the output of each decoding layer is subjected to feature fusion with the multi-scale feature map of the coding layer connected with the skip layer.

In the step (4), the ASPP module is: obtaining 4 feature maps with different sizes by using irregular convolution of

rate r

1,3,6 and 12; performing feature fusion on the obtained 4 feature graphs with different sizes by using a feature pyramid method; the output feature map of ASPP is obtained by a 1 × 1 convolution.

In step (5), the loss function is:

Loss＝λL_depth+L_grad

wherein λ is an adjustable parameter, L_depthRepresenting the distance error loss, L, between pixel points and pixel points_gradIndicating a depth gradient loss.

Further, the distance error loss L between the pixel point and the pixel point_depthUsing the Ruber loss function to calculate:

wherein, δ is a hyper-parameter threshold, which can be adaptively adjusted in the training process, and is 20% of the maximum absolute error of all image pixels in the current batch.

Further, the depth gradient loses L_gradCalculated by the following formula:

wherein N represents the total number of pixels in the depth image;

and

representing the gradient value of the depth image pixel i predicted by the model in the horizontal direction and the gradient value of the depth image pixel in the real depth image pixel in the horizontal direction of the model;

and

representing the gradient values in the vertical direction.

The invention has the beneficial effects that: the invention realizes that a joint self-attention mechanism of different stages is introduced into a depth estimation network model encoder to process remote relation, so that the model can combine context information into a high-dimensional feature map, and a filtering mechanism is realized in a joint self-attention module, and the filtering mechanism enables a network model to generate a feature map with stronger robustness.

Drawings

FIG. 1 is the overall architecture of the model of the present invention.

Fig. 2 is a schematic diagram of a spatial self-care module.

FIG. 3 is a schematic diagram of a per-channel self-attention module.

Fig. 4 is a schematic diagram of the structure of an ASPP module.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein, and it will be apparent to those of ordinary skill in the art that the present application is not limited to the specific embodiments disclosed below.

A method for predicting image depth, as shown in fig. 1, includes the following steps.

The method comprises the following steps: obtaining a plurality of original training samples, and performing data enhancement operation on the original training samples to generate a training data set, wherein the original training samples comprise an original scene graph and an original depth graph.

A training data set is prepared. In the embodiment of the present invention, the training set and the test set are NYU Depth V2 data sets, which include 464 indoor scenes. The official method is used in the embodiment of the invention to divide the training and testing sets, wherein the training set comprises 249 scenes and the testing set comprises 215 scenes. In the training process, in order to perform a larger batch of training, the resolution of the original picture is reduced from 640 × 480 to 304 × 228 by downsampling and center cropping in the embodiment of the present invention.

In this embodiment, to avoid over-fitting and improve the accuracy of the test results, the following data enhancement method is used:

and randomly flipping, namely flipping the original scene graph and the corresponding original depth graph in the horizontal direction with a probability of 50%.

And (4) color dithering, namely scaling the brightness, the saturation and the contrast of the original scene graph according to K epsilon [0.6,1.4 ].

And randomly rotating, namely randomly plane-rotating the original scene image and the original depth image according to r e [ -5, +5] degrees.

In this embodiment, the training data set includes all samples in the original training samples, randomly flipped training samples, color dithered training samples, and randomly rotated training samples.

As shown in fig. 2, the depth estimation network of the present invention includes: encoder, decoder, ASPP linking section.

Step two: and constructing an encoder layer which takes the DenseNet as a backbone network and is based on a joint attention mechanism.

Encoder architecture based on DenseNet. The output of each Dense Block of DenseNet goes through a spatial self-attention module and a channel self-attention module, the outputs of the above two modules are summed and used as the input of the next Dense Block.

In fig. 2, S100, S101, S102 and S103 are the corresponding 4 sense blocks in the DenseNet network. S105, S106, S107 and S108 are spatial self-attention modules, and S109, S110, S111 and S112 are channel self-attention modules. S113 is an ASPP module.

In the embodiment of the present invention, the DenseNet uses the structure of DenseNet-161.

As shown in fig. 2, the spatial self-attention module in the embodiment includes:

the feature matrix X is passed through four 3 × 3 convolutional layers, which are represented by S200, S201, S202, S203 in fig. 2.

The output sizes of the convolution layers are all

Where r is the downsampling rate.

Adjusting the size of the characteristic diagram obtained in the above steps, and recording as:

and M ∈ R^C×(H×W)。

Q and K are^TMatrix multiplication is carried out, and then calculation is carried out through a softmax function to obtain

A′∈R^{(H×W)×(H×W)}

A′＝softmax(QK^T)。

Introducing a filter matrix B epsilon R^{(H×W)×(H×W)}；

Wherein, the elements in the matrix B are as follows:

multiplying B and A' to obtain a filtered weight matrix alpha

α＝A′·B

Output of spatial self-attention module S ∈ R^C×H×WThe following were used:

S＝αV+M。

as shown in fig. 3, the channel self-attention module in the embodiment includes:

recording the input characteristic diagram as X ∈ R^C×H×WHeight H, width W, number of channels C

Adjusting the shape of X to obtain X' belonged to R^C×(H×W)，

Computing a channel self-attention weight matrix A

Wherein A is_i,jThe channel is derived from the value of the attention weight matrix A at the (I, j) position.

step three: and constructing a decoder taking the U-net as a backbone network.

The decoder uses four upsampling layers to gradually restore the resolution of the depth image to be consistent with the input state. And the output of each decoding layer is subjected to feature fusion with the multi-scale feature map of the coding layer connected with the skip layer.

In an embodiment of the present invention, the multi-scale feature fusion includes:

the upper layer feature map is firstly up-sampled through a deconvolution operation, and the up-sampling rate is equal to the down-sampling rate of the peer-to-peer layer down-sampling.

Step four: the encoder and decoder are connected by the ASPP structure.

The structure of the ASPP module is shown in fig. 4. Four feature maps of different sizes are obtained by irregular convolution at a rate r of 1,3,6,12, and feature fusion is performed on the obtained four feature maps of different sizes using a feature pyramid method. By the 1 × 1 convolution of S404, an output feature map of ASPP is obtained.

Step five: and (4) training by using the training data set generated in the step one and combining a loss function, and performing depth prediction on the input image by using the final model after training.

The involved loss function is:

Loss＝λL_depth+L_grad

wherein λ is an adjustable parameter;

L_depthrepresenting the distance error loss between the pixel point and the pixel point;

L_gradindicating a depth gradient loss.

In the present embodiment, L_depthA Ruber loss function is used, the formula of which is as follows.

In the present example L_gradThe formula is as follows:

wherein N represents the total number of pixels in the depth image;

and

and

representing the gradient values in the vertical direction.

In an embodiment of the invention, an Adam optimizer is used to optimize the end-to-end network model herein, where the initial learning rate is 10^-4Weighted decay parameter beta₁And beta₂Set to 0.9 and 0.999 respectively, the Batch size is set to 8, the training period is 30 epochs in total, and the learning rate is attenuated to 10% every 5 epochs.

Step six: and loading the trained weight model, inputting the test set into a depth estimation network based on a joint self-attention mechanism, and reasoning to obtain a depth image of the test set. And comparing the depth image obtained by inference with the real depth image, calculating to obtain errors and precision, and performing overall evaluation on the effect of the whole model.

In an embodiment of the present invention, 215 pairs of the NYU Depth V2 data set were used in the comparative experiment to try to slice and retain the 640 x 480 resolution of the original picture. The model was trained using a GTX1080Ti video card with 11GB video memory.

In the embodiment of the invention, a unified error evaluation index is used to compare with other algorithm results, wherein the error evaluation index is as follows:

1) root mean square error (RMS):

2) mean Relative Error (REL):

3) average log₁₀Error:

4) relative error is 1.25^kInner pixel ratio, where k ∈ {1,2,3 }:

results of examples of the invention the following table 1 shows the results of the invention compared with some relevant research work in recent years. To understand the impact of the various modules in the model herein on the overall performance, the present invention implements three different network models. The model is a U-net model which is not combined with a joint attention mechanism, and only the U-net model is combined with an ASPP module; the second model is a network model which introduces a joint self-attention mechanism in the coding layer, but sets a filtering threshold theta to be 0, namely does not introduce the filtering mechanism; model three sets the filtering threshold θ to 0.3, and introduces the filtering mechanism into the joint self-attention module.

TABLE 1

The result shows that the model introducing the combined self-attention mechanism and the filtering module is superior to a pure combined self-attention mechanism model, and the model introducing the combined self-attention mechanism is superior to a model not using the mechanism, so that the effectiveness of introducing the combined self-attention mechanism into an encoding layer and the effectiveness of the filtering mechanism in the combined self-attention module are proved. In comparison with the above 5 models, the method of the present invention achieves the lowest root mean square error. Comprehensively, the model can obtain the optimal performance in most indexes, and the superiority and the practical application value of the network model provided by the invention are fully verified.

In an actual application scenario, the model is initialized by loading pre-training parameters. And then shooting a scene image to be measured through a monocular camera. The method comprises the steps of adjusting the resolution of an image acquired by a monocular camera according to actual application requirements, then carrying out standardization processing on the adjusted image to enable the pixel value of the image to be between-1 and 1, inputting the preprocessed image into a model for reasoning, and obtaining depth information of the corresponding position of each pixel point through model reasoning. And according to the actual use condition, the resolution is restored by using a bilinear interpolation mode, and the depth image obtained by inference is restored to the size of the original image.

The invention provides a monocular depth method based on a combined self-attention mechanism aiming at practical application scenes. The method adopts an innovative network architecture, introduces a joint self-attention mechanism at different stages to process remote relations, enables the model to combine context information into a high-dimensional feature map, and realizes a filtering mechanism in a joint self-attention module, and the mechanism enables the network model to generate a feature map with stronger robustness, thereby greatly improving the effect of depth estimation. The invention has convenient deployment and small calculated amount, can meet the real-time requirement, can be conveniently deployed into related embedded equipment and can be widely applied to various actual scenes such as 3D image reconstruction, indoor modeling, attitude estimation, robot navigation and the like.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention. Although the above embodiments of the present invention have been disclosed, these specific embodiments are merely illustrative of the invention and do not limit the invention. Variations may be made by those skilled in the art without departing from the concept and scope of the invention. Therefore, the protection scope of the present invention is subject to the claims.

Claims

1. A monocular depth method based on a combined self-attention mechanism is characterized by comprising the following steps:

(3) constructing a decoder with U-net as a backbone network;

2. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in step (1), the data enhancement operation includes image cropping, random flipping, color dithering and random rotation.

3. The method according to claim 2, wherein in step (1), the training data set comprises clipped original training samples, flipped training samples, color-dithered training samples, and rotated training samples.

4. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in step (2), the spatial self-attention mechanism is: recording the input characteristic diagram as X ∈ R^C×H×WThe height is H, the width is W, and the number of channels is C; passing the feature matrix X through four 3X 3 volumesAnd (3) laminating, wherein weights are not shared among the lamination layers to obtain four characteristic graphs which are marked as:

wherein r is the downsampling rate;

adjusting M_jJ is the shape of 2,3, 4;

adjusting M₂Is marked as

Adjusting M₃、M₄Is marked as

Computing a value attention weight matrix A

Q and K are^TMatrix multiplication is carried out, and A' epsilon R is obtained through calculation of a softmax function^{(H×W)×(H×W)}

A′＝softmax(QK^T)

multiplying B and A' to obtain a filtered weight matrix alpha

α＝A′·B

Output of spatial self-attention module S ∈ R^C×H×WAs follows

S＝αV+M₁。

5. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in step (2), the channel self-attention module is: recording the input characteristic diagram as X ∈ R^C×H×WHeight ofH, the width is W, and the number of channels is C; adjusting the shape of X to obtain X' belonged to R^C×(H×W)Calculating a channel self-attention weight matrix A;

6. the monocular depth method based on the joint self-attention mechanism as claimed in claim 1, wherein, in step (3), the decoder uses four upsampling layers to gradually restore the resolution of the depth image to be consistent with the input state; and the output of each decoding layer is subjected to feature fusion with the multi-scale feature map of the coding layer connected with the skip layer.

7. The monocular depth method based on the combined attention device according to claim 1, wherein in step (4), the ASPP module is: obtaining 4 feature maps with different sizes by using irregular convolution of rate r 1,3,6 and 12; performing feature fusion on the obtained 4 feature graphs with different sizes by using a feature pyramid method; the output feature map of ASPP is obtained by a 1 × 1 convolution.

8. The monocular depth method based on the combined self-attention mechanism as claimed in claim 1, wherein in the step (5), the loss function is:

Loss＝λL_depth+L_grad

wherein λ is an adjustable parameter, L_depthRepresenting pixel points and pixelsLoss of distance error between, L_gradIndicating a depth gradient loss.

9. The method of claim 8, wherein the distance error between the pixel point and the pixel point is lost by L_depthUsing the Ruber loss function to calculate:

10. The method of claim 8, wherein the depth gradient penalty L is a function of a combined attention-machine system_gradCalculated by the following formula:

wherein N represents the sum of pixels number +in the depth image_xy_iAnd +_xy_i' represents a gradient value in a horizontal direction at a depth image pixel i predicted by the model and a gradient value in a horizontal direction of a true depth image pixel of the model_yy_iAnd +_yy_i' represents a gradient value in the vertical direction.