CN116258756A

CN116258756A - Self-supervision monocular depth estimation method and system

Info

Publication number: CN116258756A
Application number: CN202310176306.1A
Authority: CN
Inventors: 张明亮; 周大正; 李彬; 智昱旻; 刘丽霞; 张友梅; 张瑜
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-06-13
Anticipated expiration: 2043-02-23
Also published as: CN116258756B

Abstract

The invention discloses a self-supervision monocular depth estimation method and a system, wherein the method comprises the steps of obtaining an image to be estimated, inputting the preprocessed image to be estimated into a self-supervision depth estimation network for depth estimation; the self-supervision depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch adopts a jump-connection encoder-decoder structure and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module with a pyramid structure and are used for extracting local context information of an image; the output characteristics of the convolution branches are spliced with the output characteristics of the last decoding layer, and then a depth image is output through the last decoding layer; the depth image is input into a shape refinement module, an affinity matrix between adjacent pixels in the image is learned, the learned affinity matrix is associated with pixel depth pixel by pixel, a final depth image is output, and more accurate depth estimation is realized.

Description

Self-supervision monocular depth estimation method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a self-supervision monocular depth estimation method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Monocular depth estimation is to assign a depth value to each pixel of an input single image, and is widely used in the fields of computer vision tasks and augmented reality such as automatic driving, 3D reconstruction, and the like. Currently, monocular depth estimation based on deep learning is generally classified into two methods of supervised monocular depth estimation and self-supervised monocular depth estimation, and research directions. While existing surveillance methods can achieve good performance in terms of monocular depth estimation, they typically require a large and diverse number of real depth labels to train, which are often expensive to acquire and have drawbacks, particularly in outdoor scenarios, where the original depth labels acquired, for example, with LiDAR (LIght Detection And Ranging, liDAR) are typically sparse points, do not match the original image well, limiting to some extent their practical use. In contrast, the self-supervised learning approach may estimate the depth map by image re-projection across different views, without relying on any ground truth depth labels during training. Therefore, the self-supervision depth estimation has wider application field and relatively lower learning cost.

At present, the self-supervision depth estimation method is mainly divided into two ideas, namely: depth estimation of the target image is performed using a CNN (Convolutional Neural Network) framework and using a transducer framework. The methods or algorithms proposed by the prior studies are generally not able to model global correlations within a limited accepted domain, or generally lack spatial perceptual bias in modeling local information, resulting in existing self-supervised depth estimation that performs poorly in visual tasks, specifically:

(1) The CNN-based method can well extract local context information, but is usually insufficient to extract global context information with rich semantics due to small acceptance domain and large local induction deviation, so that although the performance of self-supervision depth estimation based on the CNN algorithm is gradually improved, the fundamental dilemma still exists, and global correlation cannot be simulated;

(2) The transform-based approach generally extracts global context information well for context modeling, but its potential bottleneck is the lack of representation of detail and spatial locality, because the transform-based approach is characterized by interactions between tokens, where local features are often ignored during interactions. Meanwhile, since the depth values generally follow long-tail distribution, there are many short-distance objects with smaller depth values in a natural scene, and the transform-based method cannot well estimate the short-distance objects.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a self-supervision monocular depth estimation method and a system, which are used for integrating the advantages of reserving local context information based on a CNN framework and extracting global context information based on a Transformer framework, extracting complete context information in an image scene in a self-supervision mode, realizing self-supervision depth estimation with better effect, and simultaneously realizing more accurate estimation of object boundaries by constructing a shape refinement module and perceiving semi-global feature information along horizontal and vertical directions by constructing a rectangular convolution module with a pyramid structure.

In a first aspect, the present disclosure provides a method of self-supervising monocular depth estimation.

A self-supervising monocular depth estimation method, comprising:

acquiring an image to be estimated, and preprocessing the image to be estimated;

inputting the preprocessed image to be estimated into a self-supervision depth estimation network, performing depth estimation, and outputting a depth image;

the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information of an image; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.

In a further technical scheme, in the transducer branch, each coding layer comprises a plurality of transducer blocks, and each transducer block comprises a first normalization layer, a multi-head self-attention module, a second normalization layer and a multi-layer perceptron module which are sequentially connected.

According to a further technical scheme, the rectangular convolution module is in a pyramid structure and comprises 5×5 convolutions, depth separable convolutions and 1×1 convolutions, and each convolution adopts a strip convolution form;

in the convolution branches, local characteristics output by a convolution coding layer are input into a rectangular convolution module, local characteristic information is aggregated through 5×5 convolution, global context information is respectively extracted through depth separable convolution comprising different convolution channels, information extracted by each convolution channel and the aggregated local characteristic information are aggregated through 1×1 convolution, and final aggregated output is taken as attention weight and weighted with the input local characteristics to obtain final output.

According to a further technical scheme, the pretreatment comprises the following steps:

the input image is divided into a plurality of image blocks of uniform size.

According to a further technical scheme, the self-supervision depth estimation network further comprises a shape refinement module, wherein the shape refinement module comprises a depth separable convolution, a convolution layer and a multi-layer perceptron module which are connected in sequence;

and inputting the depth image output by the decoding layer into a shape refinement module, wherein the shape refinement module learns an affinity matrix between adjacent pixels in the image, associates the learned affinity matrix with pixel depth pixel by pixel, and outputs a final depth image.

In a second aspect, the present disclosure provides a self-supervising monocular depth estimation system.

A self-supervising monocular depth estimation system, comprising:

the image to be estimated acquisition module is used for acquiring an image to be estimated and preprocessing the image to be estimated;

the depth estimation module is used for inputting the preprocessed image to be estimated into the self-supervision depth estimation network, carrying out depth estimation and outputting a depth image;

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

The one or more of the above technical solutions have the following beneficial effects:

1. the invention provides a self-supervision monocular depth estimation method and a system, which combine the advantages of local context information based on a CNN framework and global context information extraction based on a Transformer framework, extract complete context information in an image scene in a self-supervision mode, realize self-supervision depth estimation with better effect, and avoid the defects that the CNN framework cannot simulate global correlation in a limited receiving domain and the Transformer framework generally lacks space perception deviation during modeling.

2. The invention constructs a pyramid structured rectangular convolution module by considering that the rectangular objects in the image scene have strong correlation, utilizes a plurality of bar convolutions with different scales to extract semi-global information in the scene, and obtains more complete context information by sensing the semi-global characteristic information along the horizontal and vertical directions.

3. Aiming at the problem of thinning of the object edge in the scene, the invention obtains an accurate scene geometric structure by constructing the shape thinning module and learning the affinity matrix between adjacent pixels, enhances the estimation accuracy of the object edge and the detail while not affecting the complexity of the model, and further improves the accuracy of model estimation.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a method for self-monitoring monocular depth estimation according to an embodiment of the present invention;

FIG. 2 is a block diagram of a method for self-monitoring monocular depth estimation according to an embodiment of the present invention;

FIG. 3 is an algorithm flow chart of a method for self-supervising monocular depth estimation according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a transducer block in a self-monitoring depth estimation network according to an embodiment of the present invention;

FIG. 5 is a block diagram of a rectangular convolution module in a self-supervised depth estimation network according to an embodiment of the present invention;

FIG. 6 is a block diagram of a shape refinement module in a self-supervising depth estimation network according to an embodiment of the present invention;

fig. 7 is an algorithm schematic diagram of a shape refinement module in a self-supervised depth estimation network according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

Just as the existing self-supervision monocular depth estimation in the background art has the problems, the embodiment provides a cell detection method based on deep learning and machine vision, which improves the accuracy of the depth estimation, further improves the accuracy of algorithm estimation, and realizes better depth estimation effect, as shown in fig. 1 and 2, the method comprises the following steps:

the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information (semantically rich on high-level characteristics) of an image so as to overcome the defect of a CNN-based method; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information (with accurate space and fine granularity details on low-level features) of the image, so that the low-level features are prevented from being flushed by a network based on a transducer; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.

Further, the feature image obtained by encoding enters a decoding layer, and the depth image is predicted by taking layer-by-layer up-sampling and image re-projection as a supervision signal.

The overall idea of this embodiment is: and performing depth estimation on the input image by using the constructed self-supervision depth estimation network. The general framework of the self-supervised depth estimation network described above is based on encoder-decoder architecture, with a jump connection designed between the encoder and the decoder. Firstly inputting a preprocessed picture into a transducer network to extract global features of a scene, specifically, firstly dividing an input image through a partition operator to obtain a plurality of image blocks with consistent sizes, wherein each image block is used as a token; each stage then includes a Patch merge (i.e., patch merge layer) and multiple transform blocks, specifically, multiple transform blocks are connected in sequence after the Patch merge layer, where the Patch merge functions to halve the Patch resolution and double the number of channels, and each transform block includes a multi-headed self-care module and a multi-layered perceptron module. The transform-based method is characterized by interactions between tokens, and local features are often ignored during interactions, so convolution branches are introduced into the framework to supplement the local features. In the convolution branches, since the CNN-based model generally extracts local information mainly at a low-level feature layer, only the first convolution layer is used to accurately capture spatial and local context information, and the output feature map after the convolution branches is connected to the penultimate feature layer in a jump connection in the decoding section.

By combining CNN and transducer to monocular depth estimation task, the detection accuracy is superior to existing algorithms using the same common data set; the local features are supplemented by using convolution branches, and only the first convolution layer is used for estimation, so that the local features are enhanced without increasing computational complexity, the accuracy of depth estimation is improved, and the estimation accuracy of an algorithm is further improved. Compared with the existing method which only uses self-supervision monocular depth estimation, the scheme of the embodiment improves estimation accuracy.

At present, the self-supervision monocular depth estimation predicts a depth map mainly by taking image re-projection across different views as supervision signals, and specifically comprises the following steps: correcting the N pairs of stereo images

As training data, wherein->

And->

Respectively representing a left image and a right image of an ith pair of stereoscopic images, i epsilon N, N being the logarithm of training data, and on the basis, establishing an image re-projection loss between an input image and a training data composite image, wherein the formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

d for synthesizing the image _l Disparity maps for left and right images, f _ω (. Cndot.) is a warp function. />

Training the network based on the loss function, and then utilizing the leftAnd (5) inputting the image or the right image into a network for verification, and outputting a depth map. To input left picture I _l ∈R ^H×W×3 For example, the encoding layer outputs a feature map after encoding, the feature map is upsampled layer by the decoding layer, the image re-projection is used as a supervisory signal, and the corresponding disparity map d is estimated by the network _l Combining the provided camera parameters, i.e. baseline b and focal length f, then by D _l ＝bf/d _l Obtaining a final depth map D _l 。

Based on the image re-projection penalty, the present embodiment presents a self-supervised depth estimation framework consisting of a transducer branch that learns global information and a convolution branch that learns local information, as shown in fig. 3. Further, an additional rectangular convolution module is added in the convolution branches to learn semi-global information; in order to solve the problem of insufficient edge accuracy of depth estimation, a shape refinement module is added at the end of a transducer branch and a convolution branch, so that object detail estimation is enhanced.

The general framework of the self-supervised depth estimation network described above is based on encoder-decoder architecture, with a jump connection designed between the encoder and the decoder. CNN-based methods can express local contexts well, but are often inadequate to extract semantically rich global contexts due to small acceptance domains and large local induction bias. In contrast, the Transformer-based approach typically exhibits excellent global context modeling, but the potential bottleneck is the lack of representation of detail and spatial locality, mainly because the Transformer-based approach is characterized by interactions between tokens, and local features are typically ignored during interactions, while the Transformer-based approach does not estimate well because the depth values generally follow long-tailed distribution, with many short-range objects with smaller depth values in natural scenes. Therefore, the key idea of self-supervised learning is to accurately estimate the depth map from a single image by combining the advantages of CNN and transducer.

The self-supervision depth estimation network framework proposed in this embodiment mainly consists of two branches. As shown in fig. 4, the transducer branch has 4 stages in the coding section. First, input is paired by a patch operatorDividing an input image to obtain a plurality of image blocks with the same size, wherein each image block is used as a token; then, each stage has a Patch instrumentation and multiple transform blocks, each block including one MSA (Multi-head self-attention Module) and one MLP (Multi-layer perceptron Module) and two normalization layers (i.e., normalization layers). Let the output characteristic after the 1 st-1 th transducer block be z ^l-1 The features after the first transducer block are then expressed as:

where LN is layer normalization. Considering the rich spatial information is important for depth estimation. Therefore, the convolution with the step length of 4 in the original frame is converted into the convolution with the step length of 2, and the resolution of the feature map obtained by each layer of the transformers is

Where H, W is the length and width of the input original image.

In the convolution branches, since the CNN-based model generally extracts local information mainly at a low-level feature layer, only the first convolution layer is used to accurately capture spatial and local context information, and the output feature map after the convolution branches is connected to the penultimate feature layer in a jump connection in the decoding section. By setting up convolution branches, the transform branches can be avoided from discarding critical local information.

Further, the convolution branch also comprises a rectangular convolution module with a pyramid structure. Based on the characteristics obtained by the convolution branches, semi-global information of the image is obtained, and accuracy of convolution branch depth estimation is improved.

Unlike conventional self-attention, in this embodiment, the structure using a rectangular convolution pyramid is simpler and more effective than conventional self-attention, and convolution takes the form of a bar convolution, which can greatly divide bar objects in a scene, such as pedestrians, trees, and the like. This simple structured spatial attention is better able to handle spatial information than standard convolution and self-attention. As shown in fig. 5, the proposed pyramid structured rectangular convolution module comprises three phases: local information is first aggregated using a 5 x 5 depth convolution, global context information is then extracted using a depth separable convolution, and information for each channel is then aggregated using a 1 x 1 convolution. The output of the rectangular convolution module is directly used as the weight of the attention mechanism and is weighted with the input characteristics of the rectangular convolution module to obtain the final output, and the formula is as follows:

wherein G is ₀ Representing the first layer characteristics of the input res net-50,

is an element-by-element matrix multiplication operation, DW-Conv represents a depth-by-depth convolution, scale _i (0, 1,2, 3) represents different branches, scale ₀ Representing a unit connection. In the framework, after the rectangular convolution module with the pyramid structure is placed on the convolution branch of the encoder, the characteristics obtained by the convolution branch also have certain global characteristics, the segmentation of the strip objects is enhanced through the strip convolution in the rectangular convolution module with the pyramid structure, and the characteristics of the transform branch can be fused better through the convolution characteristics of the rectangular convolution module with the pyramid structure.

Further, the pixels are not directly processed because the characteristics of the transducer require analysis of the correlation of the token block. In practical problems, objects in the scene are mostly irregularly shaped, resulting in inaccurate estimation of edges and small objects by the transducer. However, the conventional dense prediction method has a large calculation amount, and the advantage of the transducer is difficult to reflect. Therefore, in order to alleviate the problem that the transducer is not good at processing the edges of the object, the present embodiment provides a shape refinement module that obtains an accurate scene geometry by learning the affinity matrix between adjacent pixels, and reduces the amount of computation while ensuring the estimation accuracy.

The shape refinement module proposed in this embodiment, as shown in fig. 6, includes a depth separable convolution, a convolution layer and a multi-layer perceptron module connected in sequence, regards an image as a combination (or a group of interrelated regions) of block-level learnable regions, each block has a flexible shape and unified semantics, abandons a common cartesian feature layout, and not only divides the image into blocks according to a grid (this way is not accurate enough, the grid of blocks cannot describe the edge shape well); moreover, the module only operates on the area level, so that the efficiency of the model can be greatly improved, and the accuracy of the model is ensured.

Specifically, this embodiment establishes a pixel-Token association to describe the region geometry, and the geometry of the described region is obtained by associating pixels with the labels of the surrounding regions. Starting from an initial h×w grid-sized image, where h×w=n, each marker is located on a single grid and the markers are used as "base points" for their corresponding regions S, where the grid itself is simply a marker position indicator, independent of the actual region geometry. By fitting each pixel p to the probability q _s (p) assigned to the region S, the association of pixel-Token is constructed without being applied to the global image, instead only for tokens satisfying the following adjacency condition:

wherein N is _p Representing the adjacent region of p. In the present embodiment, the surrounding area is set to 3×3, and therefore, N _p =9, which performed well in all models. As shown in fig. 7, the pixel p is allocated to one of the surrounding 9 regions, and the size of the division map finally output by the module is W _h ×W _w H, w are the initial set of individual mesh sizes, with mesh sizes set to 4 x 4 by default in the network. As shown in fig. 6, the module includes a lightweight affinity determination module, which includes a depth separable convolution and a 1D convolution, for fusing local information, and generating a feature map of h×w× (9 hw) size, and finally restoring to the original image size. Finally, the obtained affinity matrix is associated with the pixel depth pixel by pixel, and for each pixel p, the depth calculation formula is as follows:

/>

wherein F is _p Is the output depth map, f _p Is the pixel level depth of the corresponding region.

The self-supervision monocular depth estimation method provided by the embodiment fuses the advantages of reserving local context information based on a CNN framework and extracting global context information based on a transducer framework, extracts complete context information in an image scene in a self-supervision mode, achieves self-supervision depth estimation with better effect, simultaneously senses semi-global feature information along the horizontal and vertical directions through a rectangular convolution module with a pyramid structure, and accordingly obtains complete context information, and achieves more accurate estimation of object boundaries through a shape refinement module.

Example two

The embodiment provides a self-supervision monocular depth estimation system, which comprises:

Example III

The present embodiment provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps in the self-supervising monocular depth estimation method as described above.

Example IV

The present embodiment also provides a computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps in a self-supervising monocular depth estimation method as described above.

The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description of the second embodiment refers to the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. The self-supervision monocular depth estimation method is characterized by comprising the following steps of:

2. The method of self-supervised monocular depth estimation according to claim 1, wherein each coding layer in the transform branch comprises a plurality of transform blocks, each transform block comprising a first normalization layer, a multi-headed self-attention module, a second normalization layer, and a multi-layered perceptron module connected in sequence.

3. The method of self-supervising monocular depth estimation according to claim 1, wherein the rectangular convolution module is a pyramid structure comprising a 5 x 5 convolution, a depth separable convolution, and a 1 x 1 convolution, each convolution taking the form of a strip convolution;

4. The self-supervising monocular depth estimation method of claim 1, wherein the preprocessing comprises:

the input image is divided into a plurality of image blocks of uniform size.

5. The self-supervising monocular depth estimation method of claim 1, wherein the self-supervising depth estimation network further comprises a shape refinement module comprising a depth separable convolution, convolution layer and multi-layer perceptron module connected in sequence;

6. A self-supervising monocular depth estimation system, comprising: the image to be estimated acquisition module is used for acquiring an image to be estimated and preprocessing the image to be estimated;

7. The self-supervising monocular depth estimation system of claim 6, wherein the rectangular convolution module is a pyramid structure comprising a 5 x 5 convolution, a depth separable convolution, and a 1 x 1 convolution, each convolution taking the form of a strip convolution;

8. The self-supervising monocular depth estimation system of claim 6, wherein the self-supervising depth estimation network further comprises a shape refinement module comprising a depth separable convolution, convolution layer, and multi-layer perceptron module connected in sequence;

9. An electronic device, characterized by: comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which, when executed by the processor, perform the steps of a self-supervising monocular depth estimation method according to any one of claims 1 to 5.

10. A computer-readable storage medium, characterized by: for storing computer instructions which, when executed by a processor, perform the steps of a self-supervising monocular depth estimation method according to any one of claims 1 to 5.