CN115035172A

CN115035172A - Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement

Info

Publication number: CN115035172A
Application number: CN202210641764.3A
Authority: CN
Inventors: 李帅; 徐宏伟; 高艳博; 周华松; 元辉; 蔡珣
Original assignee: Weihai Institute Of Industrial Technology Shandong University; Shandong University
Current assignee: Weihai Institute Of Industrial Technology Shandong University; Shandong University
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-09
Anticipated expiration: 2042-06-08
Also published as: CN115035172B

Abstract

The disclosure relates to the technical field of scene depth estimation, and provides a depth estimation method and a depth estimation system based on confidence level grading and interstage fusion enhancement, wherein the depth estimation method comprises the following steps: performing pixel classification according to the depth estimation difficulty; according to pixel classification, extracting a depth map of a pixel region of each level and a confidence map corresponding to the pixel region of each level aiming at each classification; splicing the depth map and the confidence map of each stage and superposing the depth map and the confidence map step by step to obtain an inter-stage fusion confidence map, and extracting the features of the depth map and the inter-stage fusion confidence map of the pixel region of the last stage to serve as inter-stage feature enhancement information; and coding the interstage characteristic enhancement information and the target frame image to be detected as the input of the trained depth estimation network, and decoding to obtain a depth map corresponding to the target frame. The depth estimation method and device assist the difficult and error-prone regional pixel depth estimation of the next stage by utilizing the regional pixel depth value which is easy and accurate in the previous stage in a grading manner, so that the accuracy and quality of the depth estimation are improved.

Description

Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement

Technical Field

The disclosure relates to the technical field of scene depth estimation, in particular to a depth estimation method and system based on confidence level grading and interstage fusion enhancement.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Compared with two-dimensional vision, three-dimensional visual perception can provide depth information of a scene, and has wide application prospects in numerous visual tasks, such as automatic driving, three-dimensional reconstruction, augmented reality and the like, so that the depth estimation method based on the monocular video is widely concerned. With the development of deep learning, there are many monocular video-based depth estimation methods, including a supervised learning method targeting a true-value depth map and a self-supervised learning method targeting a video motion generation structure.

The inventor finds that the existing monocular depth estimation networks cannot fully play the auxiliary role among different pixels to see different object pixel points in a video scene without difference, and the quality of depth estimation is influenced. For example, when the object boundary is generated by reconstructing the target frame using the source frame, the feature is obvious, the matching between the two frames is easy, and the depth estimation is easy and accurate, while for the smooth surface, such as a road, the feature is less, the matching between the two frames is difficult, and the depth estimation is relatively difficult and is easy to be wrong. Therefore, the existing depth estimation method hesitates to adopt the depth estimation without difference, so that the estimation of the region with difficult depth estimation in the scene is relatively difficult and easy to make mistakes, and the quality of the scene estimation is reduced.

Disclosure of Invention

In order to solve the above problems, the present disclosure provides a depth estimation method and system based on confidence level classification and interstage fusion enhancement, which implement pixel classification according to the difficulty of object depth estimation, and assist the difficult and error-prone regional pixel depth estimation of the next stage by using the easy and accurate regional pixel depth value of the previous stage, thereby improving the accuracy and quality of depth estimation.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

one or more embodiments provide a depth estimation method based on confidence level grading and inter-level fusion enhancement, comprising the following steps:

performing pixel classification according to the difficulty of depth estimation of a region in a target frame scene to be detected from easy to difficult;

according to pixel classification, depth estimation is carried out on the region corresponding to each classification, and a depth map of the pixel region of each level and a confidence map of the pixel region corresponding to each level are extracted;

splicing the depth map and the confidence map of each stage and superposing the depth map and the confidence map step by step to obtain an inter-stage fusion confidence map, and extracting the features of the depth map and the inter-stage fusion confidence map of the pixel region of the last stage to serve as inter-stage feature enhancement information;

and coding the inter-grade characteristic enhancement information and the target frame image to be detected as the input of the trained depth estimation network, and then decoding to obtain a depth map corresponding to the target frame.

One or more embodiments provide a depth estimation system based on confidence ranking and inter-level fusion enhancement, comprising:

a grading module: the image classification method comprises the steps of performing pixel classification according to the depth estimation difficulty of a region in a target frame scene to be detected from easy to difficult;

a grading extraction module: the image processing device is configured to perform depth estimation on the regions corresponding to each grade according to the pixel grades, and extract a depth map of the pixel region of each grade and a confidence map of the pixel region corresponding to each grade;

the enhanced information superposition module: the image fusion method comprises the steps that the image fusion method is configured to be used for splicing and superposing the depth map and the confidence map of each level step by step to obtain an inter-level fusion confidence map, and feature extraction is carried out on the depth map of the pixel region of the last level step and the inter-level fusion confidence map to be used as inter-level feature enhancement information;

enhancement and depth map extraction module: and the inter-stage feature enhancement information and the target frame image to be detected are used as the input of the trained depth estimation network for encoding, and then the depth map corresponding to the target frame is obtained after decoding.

An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the above method.

Compared with the prior art, this disclosed beneficial effect does:

according to the depth estimation method and device, the depth estimation is classified according to the pixel level, the easily-recognized area is recognized firstly, the confidence map of the easily-recognized area is extracted to serve as the enhancement information to enhance the target frame image, the accurate area pixel depth value can assist the area pixel depth estimation which is difficult and prone to error in the next stage, the depth extraction difficulty of the area with weak texture and few features and in the area with difficulty in depth estimation is reduced, and the accuracy of image depth estimation is improved.

Advantages of the present disclosure, as well as advantages of additional aspects, will be described in detail in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure.

Fig. 1 is a block diagram of an autonomous depth estimation network based on spatial information according to embodiment 1 of the present disclosure;

fig. 2 is a schematic structural diagram of a depth estimation network for assisting in merging depths according to pixel difficulty levels and different levels in accordance with embodiment 1 of the present disclosure;

FIG. 3 is a flow chart of an estimation method of embodiment 1 of the present disclosure;

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments in the present disclosure may be combined with each other. The embodiments will be described in detail below with reference to the accompanying drawings.

Example 1

In one or more embodiments, as shown in fig. 1 to fig. 3, a depth estimation method based on confidence level ranking and inter-level fusion enhancement includes the following steps:

step 1, carrying out pixel classification according to the difficulty of depth estimation of an area in a target frame scene to be detected from easy to difficult;

step 2, according to pixel grading, respectively performing depth estimation on the region corresponding to each grade, and extracting a depth map of the pixel region of each grade and a confidence map of the pixel region corresponding to each grade;

step 3, realizing the enhancement of the inter-grade characteristic information based on the propagation and fusion of confidence coefficients: splicing the depth map and the confidence map of each stage and superposing the depth map and the confidence map stage by stage to obtain an inter-stage fusion confidence map, and performing feature extraction on the depth map and the inter-stage fusion confidence map of the pixel region of the last stage to serve as inter-stage feature enhancement information;

and 4, coding the interstage characteristic enhancement information and the target frame image to be detected as the input of the trained depth estimation network, and then decoding to obtain a depth map corresponding to the target frame.

In the embodiment, the depth estimation is classified according to the pixel level, the easily-recognized region is recognized firstly, the confidence map of the easily-recognized region is extracted as the enhancement information to enhance the target frame image, the pixel depth estimation of the region which is difficult and error-prone in the next stage can be assisted by the accurate pixel depth value of the region, the depth extraction difficulty of the depth estimation difficult region with less features is reduced, and the accuracy of the image depth estimation is improved.

In this embodiment, the pixel classification performed from easy to difficult according to the depth estimation may specifically be: corners, edges, and other areas; alternatively, edges and other regions may be classified, and the classification method may be applied to objects without corners, such as spherical objects. In this embodiment, a first classification method is specifically described.

In some embodiments, the depth estimation is performed to obtain the depth map through a depth map extraction network, the depth map extraction networks corresponding to each level are respectively set for different hierarchical pixel regions, and training is performed in a hierarchical manner to obtain the depth map extraction networks for the region features. The depth map extraction network of each level may use the same self-supervision depth estimation network, and specifically, in this embodiment, the self-supervision depth estimation network uses a self-supervision depth estimation network based on spatial information as a reference model.

In some embodiments, the self-supervised depth estimation network based on spatial information comprises a depth estimation network and an attitude transformation network:

depth estimation network: a depth map for estimating a target frame;

an attitude transformation network: for predicting a camera pose, for providing an auto-surveillance signal;

the self-supervision monocular depth estimation network provides supervision signals by reconstructing a target frame through a source frame, so that the depth estimation network is required to estimate a depth map of the target frame, and the posture transformation network is also required to predict the posture of a camera.

Optionally, the posture transformation network includes a feature extraction network using a residual error network as a core, and a plurality of cascaded convolutional layers, and specifically, in this embodiment, the convolutional layers may be set to three layers.

In this embodiment, as shown in fig. 1, for the pose transformation network, first, a monocular video target frame and a source frame are spliced and input to a feature extraction network using a residual error network as a core, and then, a camera transformation pose is estimated by three layers of convolution layers.

The output of the shallow layer network is directly added to the output of the deeper layer after the two-layer or three-layer convolution layer by the residual error network in an identity mapping mode to be used as a basic residual error unit, so that the network performance degradation is avoided while the network depth is increased.

Optionally, the residual network may adopt a residual network ResNet _18, which includes a 7 × 7 convolutional layer, a 3 × 3 max pooling layer, four residual convolutional blocks and a global average pooling layer, where each residual convolutional block includes two or more residual units with two layers of 3 × 3 convolutional layers as intermediate layers.

In some embodiments, the depth estimation network is embodied in an encoder-decoder based architecture, including a depth encoder and a depth decoder, and an upsampling module connecting each stage of the depth encoder and each stage of the depth decoder.

Firstly, inputting a monocular video target frame into a depth encoder taking a residual error network as a core to obtain multi-scale depth characteristics, wherein the attitude change network and the depth estimation network adopt a characteristic extraction network sharing mode to reduce the network parameter number.

In step 4, the decoding step is:

step 41, splicing the shallow features to the deep features after up-sampling step by step according to the number of feature channels of the depth features obtained after coding;

42, completing multi-scale feature fusion of airspace depth information through a convolution block: gradually decoding and splicing the depth features through the convolution layer and the Sigmoid activation function to obtain a multi-scale depth map;

and 43, performing bilateral linear interpolation on the multi-scale depth map to obtain a depth map with the same size as the input image, and reconstructing a target frame on the size of the input target frame image to obtain a depth map corresponding to the input target frame image.

Specifically, the deep features output by the depth encoder are subjected to 3 × 3 convolution blocks in the depth decoder and upsampling operation to reduce the number of feature channels on one hand, and on the other hand, the shallow features are spliced to the deep features after upsampling step by step according to the number of the feature channels, and multi-scale feature fusion of airspace depth information is completed through the 3 × 3 convolution blocks. And finally, in the process of multi-scale feature fusion of airspace depth information, gradually decoding and splicing the depth features by adopting a 3-by-3 convolution layer and a Sigmoid activation function to obtain a multi-scale depth map, then performing bilateral linear interpolation to the depth map with the same size as the input image, completing reconstruction of a target frame on the size of the input image, and providing a supervision signal for monocular depth estimation network training.

The reconstruction process of the target frame is a viewpoint synthesis process, as shown in formula 1, each pixel position of the target frame is processed by a depth map D _t Video source frame I _s And video object frame I _t Relative attitude T of _t→s And camera internal parameter K, calculating the corresponding source frame position coordinates, and assigning the obtained pixel value of the source frame position to the corresponding pixel value of the target frame position to obtain the reconstructed target frame I _s→t And constructing a supervision signal by using the reconstruction error of the target frame.

I _s→t ＝I _s ＜proj(D _t ，T _t→s ，K)＞ (1)

For the reconstruction errors of the target frame from depth maps of different scales of different source frames, the minimum reconstruction error can be adopted instead of the average reconstruction error as the final optical reconstruction loss function, namely:

L _p ＝min _s pe(I _t ，I _s→t ) s∈{-1，1} (2)

because when some problem pixels only appear in the target frame but not in the source frame, the network can accurately predict the depth of the problem pixels, but the problem pixels cannot be matched with the corresponding source frame pixel points due to occlusion, so that a large reprojection error penalty is generated.

In addition to the optical reconstruction loss function, an edge smoothing function is used to optimize the depth map prediction effect, wherein,

the depth gradient is represented by the depth gradient,

image gradients are represented as follows:

furthermore, the self-supervised monocular depth estimation network is typically trained under the assumption that the camera is moving, and the scene is stationary in the frame. The network predicted depth performance is greatly affected when assumptions are broken, such as the camera being fixed or there being moving objects in the scene.

In a video sequence, pixels which are kept the same in adjacent frames usually represent a still camera, a moving object or a low texture region, and by setting a simple binary automatic mask, u is 1 and the network loss function comprises the optical reconstruction loss function only when the reconstruction error of a reconstructed target frame and a reconstructed target frame is smaller than that of a target frame and a source frame. It can effectively filter pixels that remain unchanged from one frame to the next in the video. This has the effect of letting the network ignore objects that are moving at the same speed as the camera, even ignoring the entire frame in the monocular video when the camera stops moving. The final overall loss function L of the network training is:

L＝uL _P +L _S (4)

the self-supervision depth estimation network based on the spatial information can be used as a reference model of depth estimation networks of all levels, the network structure is respectively set for each pixel region classification, and different loss functions are set for training based on pixels in different classification supervision specific regions.

The following describes a depth estimation scheme based on the difficulty level ranking.

In the step 1, pixel grading is carried out based on depth estimation difficulty grading of different objects in a target frame scene to be detected;

specifically, in this embodiment, for different objects in a target frame scene, pixels are classified into corners, edges, and other regions based on the object texture richness according to the difficulty of depth estimation, and scene object depth estimation is realized in stages. The pixels of a particular region are supervised by a weighted loss function during the training process.

Specifically, a first-level corner region depth estimation network is set for a corner region and used for extracting a depth value of the corner region in a target frame image; and setting a second-level edge area depth estimation network aiming at the edge area, wherein the second-level edge area depth estimation network is used for extracting the depth value of the edge area in the target frame image.

The self-supervision depth estimation network based on spatial domain information as shown in figure 1 is used as a reference model. Firstly, a depth value at a corner point in a video target frame scene is obtained relatively easily and accurately by using a reference model.

A reference model of depth values at corners is used as a first-stage corner region depth estimation network, and a loss function in a first-stage network model training process comprises the optical reconstruction error L _p Edge smoothness loss function L _s And designing a weighted loss function L for the corner region depth estimation _C 。

Specifically, focusing on reconstruction errors at the corner points constructed by the target frame and the target frame reconstructed by the source frame, the reconstruction errors can be measured by a weighted combination mode of L1 and SSIM, and the minimum reconstruction error is adopted as the final corner loss function L for the corner point reconstruction errors from different source frames _C The following are:

L _C ＝a(1-SSIM)(I′ _s→t ，I′ _t ))+(1-a)||I′ _s→t ，I′ _t || ₁

wherein, I' _s→t And l' _t Refers to the pixels at the location of the corners in the picture.

Where, the L1 loss function: comparing the differences on a pixel-by-pixel basis and then taking the absolute value; SSIM (structurally similar) loss function: the loss of brightness (luminance), contrast (contrast) and structure (structure) indicators is taken into account.

The quality of corner detection in the target frame is directly related to whether the depth of the corner pixel points can be accurately estimated by the network, and the method further comprises the step of corner detection, wherein optionally, a Harris corner detection method can be adopted: setting a local window, and utilizing the local window to perform mobile calculation on the change of the gray value in the image, and when the pixel gray of the area in the window is greatly changed, determining that an angular point exists in the window.

Final whole for first-level corner region depth estimation network trainingThe body loss function is: l ═ uL _P +L _S +L _C . According to the embodiment, through the weighting supervision mode, an overall depth map with general quality and a sparse angular point pixel depth value with high quality can be obtained.

In some embodiments, the depth estimation is performed on the edge region which is difficult in the second stage and is error-prone, a second-stage network is constructed for realizing accurate estimation of the depth of the edge region, and similarly to the first-stage network, a loss function L for supervising the edge depth is designed _E The mode comprising the combination of L1 and SSIM measures the reconstruction error, and the loss function L of edge depth for supervision _E The following are:

L _E ＝a(1-SSIM)(I" _s→t ,I" _t ))+(1-a)||I" _s→t ,I" _t || ₁

wherein, I ″) _s→t And I _t Both refer to pixels at edge locations in the picture.

The quality of edge detection in a target frame is directly related to whether a network can accurately estimate the depth of a corner pixel point, and in order to accurately identify the edge depth of a second-level network output depth map, edge detection is firstly carried out on a target frame image, optionally, a Canny edge detection method can be adopted, and the specific steps are as follows:

removing noise in the target frame image by adopting Gaussian filtering, calculating image gradient by using a sobel operator to obtain possible edges, concentrating places with gray level change, retaining the maximum gray level change in the gradient direction in a local range by using non-maximum value inhibition, and finally screening out strong edge pixels by using double thresholds.

In the training process, the overall loss function for the second-stage network training is as follows: l ═ uL _P +L _S +L _E 。

Depth estimation is carried out on the whole monocular video, a third-level frame image integral identification depth estimation network is adopted for carrying out depth estimation on all regions of the frame image, and at the moment, because the former two-level network carries out depth estimation on angular points and edges, the angular points and the edges do not need to be considered any more, the former monocular depth can be directly utilizedThe degree estimation network performs learning. The overall loss function for the third level of network training is: l ═ uL _P +L _S 。

In step 3, an inter-level depth feature enhancement method based on confidence propagation is adopted.

Specifically, in step 3, if N pixel hierarchies are included, the depth map and the confidence map of each hierarchy are spliced and superimposed step by step to obtain an inter-hierarchy fusion confidence map, which includes the following steps:

step 31, splicing the first pixel level depth map and the confidence map to obtain a first intermediate confidence map;

step 32, splicing and fusing the first intermediate confidence coefficient map and the confidence coefficient map of the second pixel level pixel area to obtain an inter-level fusion confidence coefficient map;

step 33, splicing the second pixel level depth map and the inter-level fusion confidence map to obtain a second intermediate confidence map;

step 34, splicing and fusing the second intermediate confidence coefficient map and the confidence coefficient map of the third pixel level pixel region, and updating the inter-level fusion confidence coefficient map according to the fusion result; and (4) sequentially accumulating the depth map and the confidence map of the next pixel level according to the steps 33-34 until the depth map of the pixel region of the last level and the inter-level fusion confidence map are subjected to feature extraction to serve as inter-level feature enhancement information.

In this embodiment, two levels are set, including an angular point pixel area and an edge pixel area, and a specific step-by-step stacking process to obtain inter-level feature enhancement information includes the following steps:

on the basis of realizing pixel classification by the difficulty of object depth estimation, importantly, the method utilizes the region pixel depth value which is easy and accurate in the previous stage to assist the region pixel depth estimation which is difficult and error-prone in the next stage, and the embodiment adopts propagation and fusion based on confidence coefficient to realize the enhancement of interstage characteristic information, so that the quality of overall depth estimation is improved, and the method comprises the following specific steps:

step 3.1, according to the pixel grading and the sequence from easy to difficult of depth estimation, after the depth estimation of the first level is finished, the trained first level network model is used for carrying out the depth estimationDepth map D based on angular point is obtained by degree prediction ₁ ；

Wherein, the depth map D ₁ The pixel depth value of the center angular point area is accurate.

Step 3.2, setting the confidence coefficient of the corner position in the depth map D1 according to the corner coordinates provided by Harris corner detection, and acquiring a confidence coefficient map C based on the corner depth ₁ 。

Optionally, the corner disposal reliability is an arbitrary numerical value which is not zero, otherwise, the corner disposal reliability is 0;

in this embodiment, the confidence at the corner point may be set to 1.

Step 3.3, the depth map D ₁ And confidence map C ₁ Splicing, using the obtained result as the input of the inter-level feature extraction network based on the confidence propagation, and obtaining an intermediate confidence map C with the same resolution as the input after encoding and decoding _{1_p} ；

As shown in fig. 2, the inter-stage feature extraction network is similar to a depth estimation network based on an encoder-decoder, and comprises an encoder, a decoder and an up-sampling module; on one hand, multi-scale inter-stage depth features are output through a residual error network-based encoder, and the inter-stage depth features are fused according to the scale and the encoder of the next-stage depth estimation network; on the other hand, deep features are reduced in the number of feature channels through a convolutional layer of a depth decoder and an upsampling operation on the one hand, on the other hand, shallow features are spliced to deep features after upsampling step by step according to the number of the feature channels to complete multi-scale feature fusion of spatial domain depth information, and finally, an intermediate confidence coefficient graph C with the same resolution as the input is obtained by using 3-by-3 convolutional layer decoding fusion features _{1_p} 。

Step 3.4, obtaining a depth map D based on the edge by utilizing the trained second-stage edge region depth estimation network ₂ Setting the confidence coefficient of the corner position in the depth map D1 according to the edge coordinates provided by Canny edge detection, and acquiring a confidence coefficient map C based on the edge depth _{2_s} ；

After the depth estimation of the second-stage edge area is completed, a similar one-stage processing mode is adoptedSteps 3.1-3.3 are performed to obtain an edge-based depth map D ₂ . Obtaining a confidence map C based on edge depth according to edge coordinates provided by Canny edge detection _{2_s} Where the confidence at the edge is 1 and otherwise 0.

Intermediate confidence map C obtained using confidence propagation-based inter-level feature extraction network _{1_p} Containing depth map D ₁ Depth value of the more accurate corner region in the image, reliability information of the inter-grade depth features, and a confidence map C based on edge depth _{2_s} And the reliability information of the edge depth value obtained by fusing the depth features is provided.

Step 3.5, the intermediate confidence coefficient map C _{1_p} And confidence map C based on edge depth _{2_s} Splicing, fusing credibility reliable information to obtain an interstage fusion confidence map C ₂ ；

In order to fully utilize the reliability degree information provided by the confidence coefficient, an intermediate confidence coefficient map C is adopted _{1_p} And confidence map C based on edge depth _{2_s} Splicing and inputting the two layers of the 3 x 3 convolution layers and the RELU activation function into a simple convolution module, fusing confidence reliability information, and outputting a network to obtain an interstage fusion confidence map C ₂ 。

Step 3.5, processing the second pixel level to obtain a depth map D of the edge ₂ And inter-level fusion confidence map C ₂ Splicing and inputting the input data to an interstage feature extraction network to obtain multi-size interstage depth features, namely interstage feature enhancement information;

specifically, depth estimation is performed on the whole monocular video, and an edge-based depth map D is obtained ₂ Fused confidence map C between level and level ₂ And splicing and inputting the inter-stage depth features into an inter-stage feature extraction network based on confidence propagation to obtain multi-size inter-stage depth features, fusing the inter-stage depth features with a coder of the depth estimation network at the stage to complete final depth estimation of the whole scene, realizing multi-stage sparse pixel point inter-stage fusion depth estimation based on confidence propagation, and improving the quality of the whole depth estimation.

And further, the method for training each level of depth estimation network for depth estimation is also included, and the network corresponding to each pixel level area is trained in a grading way, so that the obtained depth estimation network of each level carries out estimation and identification on the depth map of the pixel area of the corresponding level. Specifically, the training process is as follows:

step S1, training a first-stage corner region depth estimation network, focusing on corner region depth estimation in the training process, ensuring the correctness of corner position depth in the obtained depth map according to a loss function in the first-stage network model training process, and obtaining the trained first-stage corner region depth estimation network;

identifying a target frame image to obtain a depth map D through a trained first-stage corner region depth estimation network ₁ And a confidence map C based on the corner depth ₁ ；

Step S2, training the second-stage edge region depth estimation network: map depth D ₁ And confidence map C ₁ Splicing, using the obtained result as the input of the inter-level feature extraction network based on the confidence propagation, and obtaining an intermediate confidence map C with the same resolution as the input after encoding and decoding _{1_p} ；

Map depth D ₁ And confidence map C ₁ The characteristics obtained through splicing are added to a depth encoder of a second-stage edge region depth estimation network in a multi-scale mode to train the second-stage edge region depth estimation network, in the training process, according to a training loss function of the second-stage edge region depth estimation network, edge region depth estimation is emphasized, the accuracy of edge position depth in an obtained depth image is guaranteed, and a trained second-stage edge region depth estimation network is obtained;

detecting the target frame through the trained second-stage edge region depth estimation network to obtain a depth map D ₂ And a confidence map C based on the corner depth _{2_s} ；

C is to be _{1_p} And C _{2_s} Splicing, and obtaining an edge confidence image C through convolution processing ₂ ；

The step S3 is to train the depth estimation network of the third-level frame image overall recognition to convert the depth map intoD ₂ And confidence map C ₂ And adding the spliced features in multiple scales to a depth encoder of the third-level frame image overall recognition depth estimation network to train the self-supervision depth estimation network to obtain a final depth map.

Example 2

Based on embodiment 1, this embodiment provides a depth estimation system based on confidence level ranking and inter-level fusion enhancement, including:

a grading module: the image classification method is configured to perform pixel classification according to the depth estimation difficulty of the region in the target frame scene to be detected from easy to difficult;

the enhanced information superposition module: the image fusion method comprises the steps that the image fusion method is configured to be used for splicing and superposing a depth map and a confidence map of each level step by step to obtain an inter-level fusion confidence map, and feature extraction is carried out on the depth map of a pixel region of the last level and the inter-level fusion confidence map to serve as inter-level feature enhancement information;

Example 3

The present embodiment provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of embodiment 1.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. The depth estimation method based on confidence level grading and inter-level fusion enhancement is characterized by comprising the following steps of:

2. The method of depth estimation based on confidence level ranking and inter-level fusion enhancement of claim 1, wherein: and respectively setting depth map extraction networks corresponding to each level aiming at different hierarchical pixel regions, and training in a hierarchical manner respectively to obtain the depth map extraction networks aiming at the region characteristics, wherein the depth map extraction networks of each level adopt the same self-supervision depth estimation network.

3. The method of depth estimation based on confidence level ranking and inter-level fusion enhancement of claim 2, wherein: the self-supervision depth estimation network comprises a depth estimation network and an attitude transformation network:

depth estimation network: a depth map for estimating a target frame;

posture transformation network: for predicting a camera pose, for providing an unsupervised signal;

or the posture transformation network comprises a characteristic extraction network taking a residual error network as a core and a plurality of cascaded convolution layers;

alternatively, the depth estimation network is specifically based on an encoder-decoder architecture, and includes a depth encoder and a depth decoder, and an upsampling module connecting each stage of the depth encoder and each stage of the depth decoder.

4. The method of depth estimation based on confidence level ranking and inter-level fusion enhancement as claimed in claim 1, wherein the method of decoding includes the following processes:

splicing the shallow feature to the deep feature after up-sampling step by step according to the number of feature channels of the depth feature obtained after coding;

and (3) completing multi-scale feature fusion of airspace depth information through a convolution block: gradually decoding and splicing the depth features through the convolution layer and the Sigmoid activation function to obtain a multi-scale depth map;

the multi-scale depth map is subjected to bilateral linear interpolation to obtain a depth map with the same size as the input image, and the reconstruction of a target frame is completed on the size of the input target frame image to obtain a depth map corresponding to the input target frame image;

or, the inter-stage fusion confidence map is obtained by splicing and overlapping the depth map and the confidence map of each stage step by step, and the method comprises the following steps:

step 34, splicing and fusing the second intermediate confidence coefficient map and the confidence coefficient map of the third pixel level pixel region, and updating the inter-level fusion confidence coefficient map according to the fusion result;

and (4) sequentially accumulating the depth map and the confidence map of the next pixel level according to the steps 33-34 until feature extraction is carried out on the depth map and the inter-level fusion confidence map of the pixel region of the last level to serve as inter-level feature enhancement information.

5. The method of depth estimation based on confidence level ranking and inter-level fusion enhancement of claim 1, wherein: performing pixel classification from easy to difficult according to depth estimation, specifically dividing the pixel classification into corners, edges and other areas, and setting a first-stage corner area depth estimation network for the corner areas to extract the depth values of the corner areas in the target frame image; setting a second-level edge area depth estimation network aiming at the edge area, wherein the second-level edge area depth estimation network is used for extracting the depth value of the edge area in the target frame image; and setting a third-level frame image overall recognition depth estimation network for performing depth estimation on all regions of the frame image.

6. The method of depth estimation based on confidence ranking and inter-level fusion enhancement of claim 5, wherein: the loss function of the first-stage corner region depth estimation network training is an optical reconstruction error L _p Edge smoothing loss function L _s And a weighted loss function L for the depth estimation of the corner regions _C A weighted sum of;

or, the loss function for training the second-stage edge region depth estimation network is as follows: optical reconstruction error L _p Edge smoothness loss function L _s And a loss function L for edge depth supervision _E Is calculated as a weighted sum of.

7. The method of depth estimation based on confidence ranking and inter-level fusion enhancement of claim 5, wherein: the method for training the depth estimation networks for depth estimation in a grading manner is also included, and the training process is as follows:

training a first-stage angular point regional depth estimation network, and aiming at angular point regional depth estimation in the training process, according to a loss function in the first-stage network model training process, meeting the accuracy of angular point position depth in the obtained depth image to obtain the trained first-stage angular point regional depth estimation network;

identifying a target frame image to obtain a depth map D through a trained first-stage corner region depth estimation network ₁ And a confidence map C based on the depth of the corner point ₁ ；

Training a second-stage edge area depth estimation network:

map depth D ₁ And confidence map C ₁ The characteristics obtained through splicing are added to a depth encoder of the second-stage edge region depth estimation network in a multi-scale mode to train the second-stage edge region depth estimation network, in the training process, according to a training loss function of the second-stage edge region depth estimation network, the accuracy of the edge position depth in the obtained depth map is met aiming at the edge region depth estimation, and the trained second-stage edge region depth estimation network is obtained;

Map depth D ₁ And confidence map C ₁ Splicing, using the spliced image as the input of the inter-stage feature extraction network based on confidence propagation, and obtaining an intermediate confidence map C with the same resolution as the input after encoding and decoding _{1_p} ；

The intermediate confidence map C _{1_p} And a confidence map C based on the depth of the corner point _{2_s} Splicing, and obtaining an edge confidence image C through convolution processing ₂ ；

Training the third-level frame image overall recognition depth estimation network, and obtaining a depth map D ₂ And confidence map C ₂ Splicing, adding the obtained characteristics in multiple scales to a depth encoder of a third-level frame image overall recognition depth estimation network, and training the self-supervision depth estimation network to obtain the self-supervision depth estimation networkTo the final depth map.

8. The method of depth estimation based on confidence ranking and inter-stage fusion enhancement of claim 5, wherein: the method comprises three stages including an angular point pixel region, an edge pixel region and other regions except for angular points and edges in an image, and the specific process of obtaining the interstage characteristic enhancement information in the step-by-step superposition process comprises the following steps:

according to pixel classification, according to the sequence from easy to difficult depth estimation, after the depth estimation of the first level is completed, depth prediction is carried out by utilizing a trained first level network model to obtain a depth map D based on angular points ₁ ；

Setting the confidence coefficient of the corner position in the depth map D1 according to the corner coordinates provided by Harris corner detection, and acquiring a confidence coefficient map C based on the corner depth ₁ ；

Map depth D ₁ And confidence map C ₁ Splicing, using the obtained result as the input of the inter-level feature extraction network based on the confidence propagation, and obtaining an intermediate confidence map C with the same resolution as the input after encoding and decoding _{1_p} ；

Obtaining a depth map D based on the edge by using the trained second-stage edge region depth estimation network ₂ Setting the confidence coefficient of the corner position in the depth map D1 according to the edge coordinates provided by Canny edge detection, and acquiring a confidence coefficient map C based on the edge depth _{2_s} ；

The intermediate confidence map C _{1_p} And confidence map C based on edge depth _{2_s} Splicing, fusing credibility reliable information to obtain an inter-grade fusion confidence map C ₂ 。

9. A depth estimation system based on confidence level grading and inter-level fusion enhancement is characterized by comprising:

10. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the steps of any of the methods of claims 1-8.