CN116258756A - Self-supervision monocular depth estimation method and system - Google Patents

Self-supervision monocular depth estimation method and system Download PDF

Info

Publication number
CN116258756A
CN116258756A CN202310176306.1A CN202310176306A CN116258756A CN 116258756 A CN116258756 A CN 116258756A CN 202310176306 A CN202310176306 A CN 202310176306A CN 116258756 A CN116258756 A CN 116258756A
Authority
CN
China
Prior art keywords
convolution
image
depth
self
depth estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310176306.1A
Other languages
Chinese (zh)
Other versions
CN116258756B (en
Inventor
张明亮
周大正
李彬
智昱旻
刘丽霞
张友梅
张瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202310176306.1A priority Critical patent/CN116258756B/en
Publication of CN116258756A publication Critical patent/CN116258756A/en
Application granted granted Critical
Publication of CN116258756B publication Critical patent/CN116258756B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a self-supervision monocular depth estimation method and a system, wherein the method comprises the steps of obtaining an image to be estimated, inputting the preprocessed image to be estimated into a self-supervision depth estimation network for depth estimation; the self-supervision depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch adopts a jump-connection encoder-decoder structure and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module with a pyramid structure and are used for extracting local context information of an image; the output characteristics of the convolution branches are spliced with the output characteristics of the last decoding layer, and then a depth image is output through the last decoding layer; the depth image is input into a shape refinement module, an affinity matrix between adjacent pixels in the image is learned, the learned affinity matrix is associated with pixel depth pixel by pixel, a final depth image is output, and more accurate depth estimation is realized.

Description

Self-supervision monocular depth estimation method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a self-supervision monocular depth estimation method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Monocular depth estimation is to assign a depth value to each pixel of an input single image, and is widely used in the fields of computer vision tasks and augmented reality such as automatic driving, 3D reconstruction, and the like. Currently, monocular depth estimation based on deep learning is generally classified into two methods of supervised monocular depth estimation and self-supervised monocular depth estimation, and research directions. While existing surveillance methods can achieve good performance in terms of monocular depth estimation, they typically require a large and diverse number of real depth labels to train, which are often expensive to acquire and have drawbacks, particularly in outdoor scenarios, where the original depth labels acquired, for example, with LiDAR (LIght Detection And Ranging, liDAR) are typically sparse points, do not match the original image well, limiting to some extent their practical use. In contrast, the self-supervised learning approach may estimate the depth map by image re-projection across different views, without relying on any ground truth depth labels during training. Therefore, the self-supervision depth estimation has wider application field and relatively lower learning cost.
At present, the self-supervision depth estimation method is mainly divided into two ideas, namely: depth estimation of the target image is performed using a CNN (Convolutional Neural Network) framework and using a transducer framework. The methods or algorithms proposed by the prior studies are generally not able to model global correlations within a limited accepted domain, or generally lack spatial perceptual bias in modeling local information, resulting in existing self-supervised depth estimation that performs poorly in visual tasks, specifically:
(1) The CNN-based method can well extract local context information, but is usually insufficient to extract global context information with rich semantics due to small acceptance domain and large local induction deviation, so that although the performance of self-supervision depth estimation based on the CNN algorithm is gradually improved, the fundamental dilemma still exists, and global correlation cannot be simulated;
(2) The transform-based approach generally extracts global context information well for context modeling, but its potential bottleneck is the lack of representation of detail and spatial locality, because the transform-based approach is characterized by interactions between tokens, where local features are often ignored during interactions. Meanwhile, since the depth values generally follow long-tail distribution, there are many short-distance objects with smaller depth values in a natural scene, and the transform-based method cannot well estimate the short-distance objects.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a self-supervision monocular depth estimation method and a system, which are used for integrating the advantages of reserving local context information based on a CNN framework and extracting global context information based on a Transformer framework, extracting complete context information in an image scene in a self-supervision mode, realizing self-supervision depth estimation with better effect, and simultaneously realizing more accurate estimation of object boundaries by constructing a shape refinement module and perceiving semi-global feature information along horizontal and vertical directions by constructing a rectangular convolution module with a pyramid structure.
In a first aspect, the present disclosure provides a method of self-supervising monocular depth estimation.
A self-supervising monocular depth estimation method, comprising:
acquiring an image to be estimated, and preprocessing the image to be estimated;
inputting the preprocessed image to be estimated into a self-supervision depth estimation network, performing depth estimation, and outputting a depth image;
the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information of an image; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.
In a further technical scheme, in the transducer branch, each coding layer comprises a plurality of transducer blocks, and each transducer block comprises a first normalization layer, a multi-head self-attention module, a second normalization layer and a multi-layer perceptron module which are sequentially connected.
According to a further technical scheme, the rectangular convolution module is in a pyramid structure and comprises 5×5 convolutions, depth separable convolutions and 1×1 convolutions, and each convolution adopts a strip convolution form;
in the convolution branches, local characteristics output by a convolution coding layer are input into a rectangular convolution module, local characteristic information is aggregated through 5×5 convolution, global context information is respectively extracted through depth separable convolution comprising different convolution channels, information extracted by each convolution channel and the aggregated local characteristic information are aggregated through 1×1 convolution, and final aggregated output is taken as attention weight and weighted with the input local characteristics to obtain final output.
According to a further technical scheme, the pretreatment comprises the following steps:
the input image is divided into a plurality of image blocks of uniform size.
According to a further technical scheme, the self-supervision depth estimation network further comprises a shape refinement module, wherein the shape refinement module comprises a depth separable convolution, a convolution layer and a multi-layer perceptron module which are connected in sequence;
and inputting the depth image output by the decoding layer into a shape refinement module, wherein the shape refinement module learns an affinity matrix between adjacent pixels in the image, associates the learned affinity matrix with pixel depth pixel by pixel, and outputs a final depth image.
In a second aspect, the present disclosure provides a self-supervising monocular depth estimation system.
A self-supervising monocular depth estimation system, comprising:
the image to be estimated acquisition module is used for acquiring an image to be estimated and preprocessing the image to be estimated;
the depth estimation module is used for inputting the preprocessed image to be estimated into the self-supervision depth estimation network, carrying out depth estimation and outputting a depth image;
the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information of an image; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.
According to a further technical scheme, the self-supervision depth estimation network further comprises a shape refinement module, wherein the shape refinement module comprises a depth separable convolution, a convolution layer and a multi-layer perceptron module which are connected in sequence;
and inputting the depth image output by the decoding layer into a shape refinement module, wherein the shape refinement module learns an affinity matrix between adjacent pixels in the image, associates the learned affinity matrix with pixel depth pixel by pixel, and outputs a final depth image.
In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.
The one or more of the above technical solutions have the following beneficial effects:
1. the invention provides a self-supervision monocular depth estimation method and a system, which combine the advantages of local context information based on a CNN framework and global context information extraction based on a Transformer framework, extract complete context information in an image scene in a self-supervision mode, realize self-supervision depth estimation with better effect, and avoid the defects that the CNN framework cannot simulate global correlation in a limited receiving domain and the Transformer framework generally lacks space perception deviation during modeling.
2. The invention constructs a pyramid structured rectangular convolution module by considering that the rectangular objects in the image scene have strong correlation, utilizes a plurality of bar convolutions with different scales to extract semi-global information in the scene, and obtains more complete context information by sensing the semi-global characteristic information along the horizontal and vertical directions.
3. Aiming at the problem of thinning of the object edge in the scene, the invention obtains an accurate scene geometric structure by constructing the shape thinning module and learning the affinity matrix between adjacent pixels, enhances the estimation accuracy of the object edge and the detail while not affecting the complexity of the model, and further improves the accuracy of model estimation.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flow chart of a method for self-monitoring monocular depth estimation according to an embodiment of the present invention;
FIG. 2 is a block diagram of a method for self-monitoring monocular depth estimation according to an embodiment of the present invention;
FIG. 3 is an algorithm flow chart of a method for self-supervising monocular depth estimation according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a transducer block in a self-monitoring depth estimation network according to an embodiment of the present invention;
FIG. 5 is a block diagram of a rectangular convolution module in a self-supervised depth estimation network according to an embodiment of the present invention;
FIG. 6 is a block diagram of a shape refinement module in a self-supervising depth estimation network according to an embodiment of the present invention;
fig. 7 is an algorithm schematic diagram of a shape refinement module in a self-supervised depth estimation network according to an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
Just as the existing self-supervision monocular depth estimation in the background art has the problems, the embodiment provides a cell detection method based on deep learning and machine vision, which improves the accuracy of the depth estimation, further improves the accuracy of algorithm estimation, and realizes better depth estimation effect, as shown in fig. 1 and 2, the method comprises the following steps:
acquiring an image to be estimated, and preprocessing the image to be estimated;
inputting the preprocessed image to be estimated into a self-supervision depth estimation network, performing depth estimation, and outputting a depth image;
the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information (semantically rich on high-level characteristics) of an image so as to overcome the defect of a CNN-based method; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information (with accurate space and fine granularity details on low-level features) of the image, so that the low-level features are prevented from being flushed by a network based on a transducer; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.
Further, the feature image obtained by encoding enters a decoding layer, and the depth image is predicted by taking layer-by-layer up-sampling and image re-projection as a supervision signal.
The overall idea of this embodiment is: and performing depth estimation on the input image by using the constructed self-supervision depth estimation network. The general framework of the self-supervised depth estimation network described above is based on encoder-decoder architecture, with a jump connection designed between the encoder and the decoder. Firstly inputting a preprocessed picture into a transducer network to extract global features of a scene, specifically, firstly dividing an input image through a partition operator to obtain a plurality of image blocks with consistent sizes, wherein each image block is used as a token; each stage then includes a Patch merge (i.e., patch merge layer) and multiple transform blocks, specifically, multiple transform blocks are connected in sequence after the Patch merge layer, where the Patch merge functions to halve the Patch resolution and double the number of channels, and each transform block includes a multi-headed self-care module and a multi-layered perceptron module. The transform-based method is characterized by interactions between tokens, and local features are often ignored during interactions, so convolution branches are introduced into the framework to supplement the local features. In the convolution branches, since the CNN-based model generally extracts local information mainly at a low-level feature layer, only the first convolution layer is used to accurately capture spatial and local context information, and the output feature map after the convolution branches is connected to the penultimate feature layer in a jump connection in the decoding section.
By combining CNN and transducer to monocular depth estimation task, the detection accuracy is superior to existing algorithms using the same common data set; the local features are supplemented by using convolution branches, and only the first convolution layer is used for estimation, so that the local features are enhanced without increasing computational complexity, the accuracy of depth estimation is improved, and the estimation accuracy of an algorithm is further improved. Compared with the existing method which only uses self-supervision monocular depth estimation, the scheme of the embodiment improves estimation accuracy.
At present, the self-supervision monocular depth estimation predicts a depth map mainly by taking image re-projection across different views as supervision signals, and specifically comprises the following steps: correcting the N pairs of stereo images
Figure BDA0004100985220000071
As training data, wherein->
Figure BDA0004100985220000072
And->
Figure BDA0004100985220000073
Respectively representing a left image and a right image of an ith pair of stereoscopic images, i epsilon N, N being the logarithm of training data, and on the basis, establishing an image re-projection loss between an input image and a training data composite image, wherein the formula is as follows:
Figure BDA0004100985220000074
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004100985220000075
d for synthesizing the image l Disparity maps for left and right images, f ω (. Cndot.) is a warp function. />
Training the network based on the loss function, and then utilizing the leftAnd (5) inputting the image or the right image into a network for verification, and outputting a depth map. To input left picture I l ∈R H×W×3 For example, the encoding layer outputs a feature map after encoding, the feature map is upsampled layer by the decoding layer, the image re-projection is used as a supervisory signal, and the corresponding disparity map d is estimated by the network l Combining the provided camera parameters, i.e. baseline b and focal length f, then by D l =bf/d l Obtaining a final depth map D l
Based on the image re-projection penalty, the present embodiment presents a self-supervised depth estimation framework consisting of a transducer branch that learns global information and a convolution branch that learns local information, as shown in fig. 3. Further, an additional rectangular convolution module is added in the convolution branches to learn semi-global information; in order to solve the problem of insufficient edge accuracy of depth estimation, a shape refinement module is added at the end of a transducer branch and a convolution branch, so that object detail estimation is enhanced.
The general framework of the self-supervised depth estimation network described above is based on encoder-decoder architecture, with a jump connection designed between the encoder and the decoder. CNN-based methods can express local contexts well, but are often inadequate to extract semantically rich global contexts due to small acceptance domains and large local induction bias. In contrast, the Transformer-based approach typically exhibits excellent global context modeling, but the potential bottleneck is the lack of representation of detail and spatial locality, mainly because the Transformer-based approach is characterized by interactions between tokens, and local features are typically ignored during interactions, while the Transformer-based approach does not estimate well because the depth values generally follow long-tailed distribution, with many short-range objects with smaller depth values in natural scenes. Therefore, the key idea of self-supervised learning is to accurately estimate the depth map from a single image by combining the advantages of CNN and transducer.
The self-supervision depth estimation network framework proposed in this embodiment mainly consists of two branches. As shown in fig. 4, the transducer branch has 4 stages in the coding section. First, input is paired by a patch operatorDividing an input image to obtain a plurality of image blocks with the same size, wherein each image block is used as a token; then, each stage has a Patch instrumentation and multiple transform blocks, each block including one MSA (Multi-head self-attention Module) and one MLP (Multi-layer perceptron Module) and two normalization layers (i.e., normalization layers). Let the output characteristic after the 1 st-1 th transducer block be z l-1 The features after the first transducer block are then expressed as:
Figure BDA0004100985220000081
Figure BDA0004100985220000082
where LN is layer normalization. Considering the rich spatial information is important for depth estimation. Therefore, the convolution with the step length of 4 in the original frame is converted into the convolution with the step length of 2, and the resolution of the feature map obtained by each layer of the transformers is
Figure BDA0004100985220000083
Where H, W is the length and width of the input original image.
In the convolution branches, since the CNN-based model generally extracts local information mainly at a low-level feature layer, only the first convolution layer is used to accurately capture spatial and local context information, and the output feature map after the convolution branches is connected to the penultimate feature layer in a jump connection in the decoding section. By setting up convolution branches, the transform branches can be avoided from discarding critical local information.
Further, the convolution branch also comprises a rectangular convolution module with a pyramid structure. Based on the characteristics obtained by the convolution branches, semi-global information of the image is obtained, and accuracy of convolution branch depth estimation is improved.
Unlike conventional self-attention, in this embodiment, the structure using a rectangular convolution pyramid is simpler and more effective than conventional self-attention, and convolution takes the form of a bar convolution, which can greatly divide bar objects in a scene, such as pedestrians, trees, and the like. This simple structured spatial attention is better able to handle spatial information than standard convolution and self-attention. As shown in fig. 5, the proposed pyramid structured rectangular convolution module comprises three phases: local information is first aggregated using a 5 x 5 depth convolution, global context information is then extracted using a depth separable convolution, and information for each channel is then aggregated using a 1 x 1 convolution. The output of the rectangular convolution module is directly used as the weight of the attention mechanism and is weighted with the input characteristics of the rectangular convolution module to obtain the final output, and the formula is as follows:
Figure BDA0004100985220000091
Figure BDA0004100985220000092
wherein G is 0 Representing the first layer characteristics of the input res net-50,
Figure BDA0004100985220000093
is an element-by-element matrix multiplication operation, DW-Conv represents a depth-by-depth convolution, scale i (0, 1,2, 3) represents different branches, scale 0 Representing a unit connection. In the framework, after the rectangular convolution module with the pyramid structure is placed on the convolution branch of the encoder, the characteristics obtained by the convolution branch also have certain global characteristics, the segmentation of the strip objects is enhanced through the strip convolution in the rectangular convolution module with the pyramid structure, and the characteristics of the transform branch can be fused better through the convolution characteristics of the rectangular convolution module with the pyramid structure.
Further, the pixels are not directly processed because the characteristics of the transducer require analysis of the correlation of the token block. In practical problems, objects in the scene are mostly irregularly shaped, resulting in inaccurate estimation of edges and small objects by the transducer. However, the conventional dense prediction method has a large calculation amount, and the advantage of the transducer is difficult to reflect. Therefore, in order to alleviate the problem that the transducer is not good at processing the edges of the object, the present embodiment provides a shape refinement module that obtains an accurate scene geometry by learning the affinity matrix between adjacent pixels, and reduces the amount of computation while ensuring the estimation accuracy.
The shape refinement module proposed in this embodiment, as shown in fig. 6, includes a depth separable convolution, a convolution layer and a multi-layer perceptron module connected in sequence, regards an image as a combination (or a group of interrelated regions) of block-level learnable regions, each block has a flexible shape and unified semantics, abandons a common cartesian feature layout, and not only divides the image into blocks according to a grid (this way is not accurate enough, the grid of blocks cannot describe the edge shape well); moreover, the module only operates on the area level, so that the efficiency of the model can be greatly improved, and the accuracy of the model is ensured.
Specifically, this embodiment establishes a pixel-Token association to describe the region geometry, and the geometry of the described region is obtained by associating pixels with the labels of the surrounding regions. Starting from an initial h×w grid-sized image, where h×w=n, each marker is located on a single grid and the markers are used as "base points" for their corresponding regions S, where the grid itself is simply a marker position indicator, independent of the actual region geometry. By fitting each pixel p to the probability q s (p) assigned to the region S, the association of pixel-Token is constructed without being applied to the global image, instead only for tokens satisfying the following adjacency condition:
Figure BDA0004100985220000101
wherein N is p Representing the adjacent region of p. In the present embodiment, the surrounding area is set to 3×3, and therefore, N p =9, which performed well in all models. As shown in fig. 7, the pixel p is allocated to one of the surrounding 9 regions, and the size of the division map finally output by the module is W h ×W w H, w are the initial set of individual mesh sizes, with mesh sizes set to 4 x 4 by default in the network. As shown in fig. 6, the module includes a lightweight affinity determination module, which includes a depth separable convolution and a 1D convolution, for fusing local information, and generating a feature map of h×w× (9 hw) size, and finally restoring to the original image size. Finally, the obtained affinity matrix is associated with the pixel depth pixel by pixel, and for each pixel p, the depth calculation formula is as follows:
Figure BDA0004100985220000111
/>
wherein F is p Is the output depth map, f p Is the pixel level depth of the corresponding region.
The self-supervision monocular depth estimation method provided by the embodiment fuses the advantages of reserving local context information based on a CNN framework and extracting global context information based on a transducer framework, extracts complete context information in an image scene in a self-supervision mode, achieves self-supervision depth estimation with better effect, simultaneously senses semi-global feature information along the horizontal and vertical directions through a rectangular convolution module with a pyramid structure, and accordingly obtains complete context information, and achieves more accurate estimation of object boundaries through a shape refinement module.
Example two
The embodiment provides a self-supervision monocular depth estimation system, which comprises:
the image to be estimated acquisition module is used for acquiring an image to be estimated and preprocessing the image to be estimated;
the depth estimation module is used for inputting the preprocessed image to be estimated into the self-supervision depth estimation network, carrying out depth estimation and outputting a depth image;
the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information of an image; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.
Example III
The present embodiment provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps in the self-supervising monocular depth estimation method as described above.
Example IV
The present embodiment also provides a computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps in a self-supervising monocular depth estimation method as described above.
The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description of the second embodiment refers to the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (10)

1. The self-supervision monocular depth estimation method is characterized by comprising the following steps of:
acquiring an image to be estimated, and preprocessing the image to be estimated;
inputting the preprocessed image to be estimated into a self-supervision depth estimation network, performing depth estimation, and outputting a depth image;
the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information of an image; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.
2. The method of self-supervised monocular depth estimation according to claim 1, wherein each coding layer in the transform branch comprises a plurality of transform blocks, each transform block comprising a first normalization layer, a multi-headed self-attention module, a second normalization layer, and a multi-layered perceptron module connected in sequence.
3. The method of self-supervising monocular depth estimation according to claim 1, wherein the rectangular convolution module is a pyramid structure comprising a 5 x 5 convolution, a depth separable convolution, and a 1 x 1 convolution, each convolution taking the form of a strip convolution;
in the convolution branches, local characteristics output by a convolution coding layer are input into a rectangular convolution module, local characteristic information is aggregated through 5×5 convolution, global context information is respectively extracted through depth separable convolution comprising different convolution channels, information extracted by each convolution channel and the aggregated local characteristic information are aggregated through 1×1 convolution, and final aggregated output is taken as attention weight and weighted with the input local characteristics to obtain final output.
4. The self-supervising monocular depth estimation method of claim 1, wherein the preprocessing comprises:
the input image is divided into a plurality of image blocks of uniform size.
5. The self-supervising monocular depth estimation method of claim 1, wherein the self-supervising depth estimation network further comprises a shape refinement module comprising a depth separable convolution, convolution layer and multi-layer perceptron module connected in sequence;
and inputting the depth image output by the decoding layer into a shape refinement module, wherein the shape refinement module learns an affinity matrix between adjacent pixels in the image, associates the learned affinity matrix with pixel depth pixel by pixel, and outputs a final depth image.
6. A self-supervising monocular depth estimation system, comprising: the image to be estimated acquisition module is used for acquiring an image to be estimated and preprocessing the image to be estimated;
the depth estimation module is used for inputting the preprocessed image to be estimated into the self-supervision depth estimation network, carrying out depth estimation and outputting a depth image;
the self-supervising depth estimation network comprises a transducer branch and a convolution branch, wherein the transducer branch is an encoder-decoder structure adopting jump connection and is used for capturing global context information of an image; the convolution branches are a convolution coding layer and a rectangular convolution module and are used for extracting local context information of an image; and splicing the output characteristics of the convolution branches with the output characteristics of the last decoding layer in the converter branches, and outputting a depth image through the last decoding layer.
7. The self-supervising monocular depth estimation system of claim 6, wherein the rectangular convolution module is a pyramid structure comprising a 5 x 5 convolution, a depth separable convolution, and a 1 x 1 convolution, each convolution taking the form of a strip convolution;
in the convolution branches, local characteristics output by a convolution coding layer are input into a rectangular convolution module, local characteristic information is aggregated through 5×5 convolution, global context information is respectively extracted through depth separable convolution comprising different convolution channels, information extracted by each convolution channel and the aggregated local characteristic information are aggregated through 1×1 convolution, and final aggregated output is taken as attention weight and weighted with the input local characteristics to obtain final output.
8. The self-supervising monocular depth estimation system of claim 6, wherein the self-supervising depth estimation network further comprises a shape refinement module comprising a depth separable convolution, convolution layer, and multi-layer perceptron module connected in sequence;
and inputting the depth image output by the decoding layer into a shape refinement module, wherein the shape refinement module learns an affinity matrix between adjacent pixels in the image, associates the learned affinity matrix with pixel depth pixel by pixel, and outputs a final depth image.
9. An electronic device, characterized by: comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which, when executed by the processor, perform the steps of a self-supervising monocular depth estimation method according to any one of claims 1 to 5.
10. A computer-readable storage medium, characterized by: for storing computer instructions which, when executed by a processor, perform the steps of a self-supervising monocular depth estimation method according to any one of claims 1 to 5.
CN202310176306.1A 2023-02-23 2023-02-23 Self-supervision monocular depth estimation method and system Active CN116258756B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310176306.1A CN116258756B (en) 2023-02-23 2023-02-23 Self-supervision monocular depth estimation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310176306.1A CN116258756B (en) 2023-02-23 2023-02-23 Self-supervision monocular depth estimation method and system

Publications (2)

Publication Number Publication Date
CN116258756A true CN116258756A (en) 2023-06-13
CN116258756B CN116258756B (en) 2024-03-08

Family

ID=86683978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310176306.1A Active CN116258756B (en) 2023-02-23 2023-02-23 Self-supervision monocular depth estimation method and system

Country Status (1)

Country Link
CN (1) CN116258756B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437272A (en) * 2023-12-21 2024-01-23 齐鲁工业大学(山东省科学院) Monocular depth estimation method and system based on adaptive token aggregation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190213481A1 (en) * 2016-09-12 2019-07-11 Niantic, Inc. Predicting depth from image data using a statistical model
CN114066959A (en) * 2021-11-25 2022-02-18 天津工业大学 Single-stripe image depth estimation method based on Transformer
CN114897136A (en) * 2022-04-29 2022-08-12 清华大学 Multi-scale attention mechanism method and module and image processing method and device
CN115035171A (en) * 2022-05-31 2022-09-09 西北工业大学 Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN115035172A (en) * 2022-06-08 2022-09-09 山东大学 Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement
CN115423927A (en) * 2022-07-26 2022-12-02 复旦大学 ViT-based multi-view 3D reconstruction method and system
CN115620023A (en) * 2022-09-28 2023-01-17 广州大学 Real-time monocular depth estimation method fusing global features

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190213481A1 (en) * 2016-09-12 2019-07-11 Niantic, Inc. Predicting depth from image data using a statistical model
CN114066959A (en) * 2021-11-25 2022-02-18 天津工业大学 Single-stripe image depth estimation method based on Transformer
CN114897136A (en) * 2022-04-29 2022-08-12 清华大学 Multi-scale attention mechanism method and module and image processing method and device
CN115035171A (en) * 2022-05-31 2022-09-09 西北工业大学 Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN115035172A (en) * 2022-06-08 2022-09-09 山东大学 Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement
CN115423927A (en) * 2022-07-26 2022-12-02 复旦大学 ViT-based multi-view 3D reconstruction method and system
CN115620023A (en) * 2022-09-28 2023-01-17 广州大学 Real-time monocular depth estimation method fusing global features

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JIYUAN ZHANG ET AL.: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV 2022: COMPUTER VISION, 13 November 2022 (2022-11-13), pages 34 - 52 *
ZHENYU LI ET AL.: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", MACHINE INTELLIGENCE RESEARCH, vol. 20, no. 06, pages 837 - 854 *
张明亮: "面向异源相机数据的高质量单目场景深度推断", 中国博士学位论文全文数据库 (信息科技辑), no. 01, pages 138 - 87 *
杨依凡: "基于深度学习的单目深度估计算法研究", 中国博士学位论文全文数据库 (信息科技辑), no. 01, 15 January 2023 (2023-01-15), pages 138 - 67 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437272A (en) * 2023-12-21 2024-01-23 齐鲁工业大学(山东省科学院) Monocular depth estimation method and system based on adaptive token aggregation
CN117437272B (en) * 2023-12-21 2024-03-08 齐鲁工业大学(山东省科学院) Monocular depth estimation method and system based on adaptive token aggregation

Also Published As

Publication number Publication date
CN116258756B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN111292264B (en) Image high dynamic range reconstruction method based on deep learning
EP4137991A1 (en) Pedestrian re-identification method and device
CN109003297A (en) A kind of monocular depth estimation method, device, terminal and storage medium
CN114936605A (en) Knowledge distillation-based neural network training method, device and storage medium
CN116229461A (en) Indoor scene image real-time semantic segmentation method based on multi-scale refinement
CN114758337B (en) Semantic instance reconstruction method, device, equipment and medium
CN113592726A (en) High dynamic range imaging method, device, electronic equipment and storage medium
CN116205962B (en) Monocular depth estimation method and system based on complete context information
CN111833360B (en) Image processing method, device, equipment and computer readable storage medium
CN116258756B (en) Self-supervision monocular depth estimation method and system
CN114219855A (en) Point cloud normal vector estimation method and device, computer equipment and storage medium
CN115131281A (en) Method, device and equipment for training change detection model and detecting image change
CN112668675B (en) Image processing method and device, computer equipment and storage medium
CN116342675B (en) Real-time monocular depth estimation method, system, electronic equipment and storage medium
CN110827341A (en) Picture depth estimation method and device and storage medium
CN112396657A (en) Neural network-based depth pose estimation method and device and terminal equipment
CN116993987A (en) Image semantic segmentation method and system based on lightweight neural network model
CN116309050A (en) Image super-resolution method, program product, storage medium and electronic device
CN115861601A (en) Multi-sensor fusion sensing method and device
CN115249269A (en) Object detection method, computer program product, storage medium, and electronic device
CN114648604A (en) Image rendering method, electronic device, storage medium and program product
CN114820755A (en) Depth map estimation method and system
CN113920317A (en) Semantic segmentation method based on visible light image and low-resolution depth image
CN114119678A (en) Optical flow estimation method, computer program product, storage medium, and electronic device
CN116152233B (en) Image processing method, intelligent terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant