CN114882091A

CN114882091A - Depth estimation method combined with semantic edge

Info

Publication number: CN114882091A
Application number: CN202210476348.2A
Authority: CN
Inventors: 朱冬晨; 吴德明; 张广慧; 石文君; 李嘉茂; 王磊; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-09
Anticipated expiration: 2042-04-29
Also published as: CN114882091B

Abstract

The invention relates to a depth estimation method combined with semantic edges, which comprises the following steps: acquiring an image to be subjected to depth estimation; inputting the image into a trained deep learning network to obtain a depth prediction image and a semantic edge prediction image; the deep learning network includes: the system comprises a shared feature extraction module, a depth estimation module, an edge enhancement weight module, a depth edge semantic classification module and a semantic edge detection module; the shared feature extraction module is used for extracting feature information in the image and transmitting the feature information to the depth estimation module and the semantic edge detection module; the depth estimation module guides parallax smoothing through the semantic edge output by the semantic edge detection module and carries out depth estimation in an image double reconstruction mode; the edge enhancement weighting module forms a feature result required to be fused by the semantic edge detection module based on the depth edge of the depth prediction image output by the depth estimation module; the depth edge semantic classification module is used for performing depth edge semantic classification prediction; the semantic edge detection module is used for outputting semantic edge classification prediction of the image. The invention can improve the accuracy.

Description

Depth estimation method combined with semantic edge

Technical Field

The invention relates to the technical field of computer vision, in particular to a depth estimation method combined with semantic edges.

Background

The depth estimation and semantic edge extraction are used as basic problems of computer vision, and the task results can be deployed in practical sciences such as automatic driving, virtual reality, robots and the like to assist in achieving better results. The depth estimation refers to analyzing three-dimensional perception information from an image, a semantic edge extraction task is a task combining an edge extraction task and a classification task of the image, the semantic information of the edge and the boundary of an object can be obtained simultaneously, and the two tasks are processed by a deep learning method at present.

Depth estimation is further divided into monocular depth estimation and multi-ocular depth estimation, wherein monocular depth estimation has the advantages of fast processing speed, low cost and the like to a great extent due to the fact that the monocular depth estimation uses few sensors. The multi-view depth estimation uses two or more sensors to acquire information, so that the defects of information redundancy, complicated labeling and the like exist to a great extent. The use of monocular depth estimation is therefore also the mainstream choice in current research and markets.

Monocular depth estimation has a problem of low accuracy because depth estimation is performed using only a single piece of image information. The current mainstream method is to guide training of a depth estimation network by using a true value of additional information, such as edge information, semantic information, and the like, so as to generate a depth map with higher precision. The problem with the above methods is that there is no truth value for the extra information at the time of inference, so that only the trained method can be used to generate the required extra information, thereby losing the reliability of using truth value to guide at the time of inference. Another irrationale existing in the current mainstream monocular depth estimation schemes is to use one left view to generate both left and right disparity maps simultaneously, but it is considered irrational to use the left view to generate right disparity.

The main difficulty of semantic edge extraction is to suppress the influence of non-semantic edges, and in order to obtain finer semantic edges, the mainstream method adopted at present is to add non-maximum suppression Loss (NMS _ Loss) or use dynamic weights to enhance edge response and suppress non-edge response when training a network. The above method using dynamic weights has problems in that the weights are obtained by forcibly learning from features, theoretical support is lacked, and a learning layer with more parameters is added, so that the network becomes complicated.

Currently, the association of semantics and depth and the association of edges and depth are the main methods for joint learning of exploration and depth estimation tasks. However, there is little research on how to correlate depth estimates with semantic edges. Two simple strategies are possible. The first is that given semantic edge labels, we can do depth estimation through the guidance of the labels. Otherwise, given a depth label, semantic edge detection may be performed by directing the depth label. However, both of the above approaches depend largely on the accuracy of a given tag and are both step-wise modes, which can be suboptimal and inefficient.

Disclosure of Invention

The invention aims to provide a depth estimation method combined with semantic edges, which can improve the accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a depth estimation method combined with semantic edges is provided, which comprises the following steps:

acquiring an image to be subjected to depth estimation;

inputting the image into a trained deep learning network to obtain a depth prediction image and a semantic edge prediction image;

wherein the deep learning network comprises: the system comprises a shared feature extraction module, a depth estimation module, an edge enhancement weight module, a depth edge semantic classification module and a semantic edge detection module; the shared feature extraction module is used for extracting feature information in the image and transmitting the feature information to the depth estimation module and the semantic edge detection module; the depth estimation module guides parallax smoothing through the semantic edge output by the semantic edge detection module and carries out depth estimation in an image double reconstruction mode; the edge enhancement weighting module forms a feature result required to be fused by the semantic edge detection module based on the depth edge of the depth prediction image output by the depth estimation module; the depth edge semantic classification module is used for performing depth edge semantic classification prediction; the semantic edge detection module is used for outputting semantic edge classification prediction of the image.

The depth estimation module guides parallax smoothing through the semantic edge output by the semantic edge detection module, and the formula is as follows:

wherein,

representing the gradient solution in the X and Y directions, respectively, d _i,j Representing the parallax and the value of (i, j) in an RGB image, N representing the number of pixels, S _i,j Represents the value of (i, j) in the semantic edge image, ε is the hyper-parameter.

The image double-construction means that the pixel value of a certain point on the left view is moved by the parallax value pixel of the left parallax and assigned to the position to obtain the right view, the parallax value of the certain point of the left parallax is used for searching the pixel value of the point on the right view and assigned to the point to reconstruct the left view.

The edge enhancement weighting module comprises: extracting a depth edge of the depth prediction map through an edge detection operator, inputting the depth edge into an EEW unit, outputting dynamic weight information by the EEW unit, wherein the weight information needs to satisfy F ═ W × F, wherein F ═ { A { (A } W } F { (A } F } W } F ⁽¹⁾ ,A ⁽²⁾ ,A ⁽³⁾ }，A ⁽¹⁾ ,A ⁽²⁾ ,A ⁽³⁾ Respectively representing the feature information of different depths extracted by the shared feature extraction module, and W representing dynamic weight information.

The depth edge semantic classification module combines the depth edge extracted by the Laplace operator and the feature information extracted by the shared feature extraction module, carries out depth edge semantic classification prediction through the CASSENet, and supervises by adopting the multi-label loss of the CASSENet.

The depth edge semantic classification module generates a basic true value of depth edge semantic classification by using the predicted depth edge and the basic true value of the semantic edge; and taking the intersection part of the depth edge and the ground truth value of the semantic edge as the ground truth value of the depth edge semantic classification task.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention realizes mutual benefit and benefit of the tasks of depth estimation and semantic edge extraction by sharing the feature extraction module. The invention utilizes semantic edge guided disparity smoothing, image dual reconstruction to improve depth estimation at the edges. In semantic edge detection, an edge enhancement weight strategy is proposed, which enhances edge pixels by learning weights from deep edges and assigning them to edge features, thereby improving the accuracy of semantic edges. The invention also provides a depth edge semantic classification model to realize semantic edge and depth edge consistency constraint so as to realize implicit loss monitoring.

Drawings

FIG. 1 is a schematic structural diagram of a deep learning network in an embodiment of the present invention;

FIG. 2 is a schematic diagram of image double reconstruction in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an edge enhancement weighting module according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a deep edge semantic classification module according to an embodiment of the invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a depth estimation method combined with semantic edges, which comprises the following steps: acquiring an image to be subjected to depth estimation; inputting the image into a trained deep learning network to obtain a depth prediction image and a semantic edge prediction image;

as shown in fig. 1, the deep learning network includes: the system comprises a shared feature extraction module, a depth estimation module, an edge enhancement weight module, a depth edge semantic classification module and a semantic edge detection module; the shared feature extraction module is used for extracting feature information in the image and transmitting the feature information to the depth estimation module and the semantic edge detection module; the depth estimation module guides parallax smoothing through the semantic edge output by the semantic edge detection module and carries out depth estimation in an image double reconstruction mode; the edge enhancement weighting module forms a feature result required to be fused by the semantic edge detection module based on the depth edge of the depth prediction image output by the depth estimation module; the depth edge semantic classification module is used for performing depth edge semantic classification prediction; the semantic edge detection module is used for outputting semantic edge classification prediction of the image.

According to the embodiment, the strong consistency relationship between the depth edge and the semantic edge is utilized, so that task synchronization joint learning, edge enhancement weight, depth edge semantic classification, semantic edge guide parallax smoothing and image double-structure loss are realized.

In common disparity estimation models, the original RGB image is usually used to smoothly guide the disparity, i.e., the local smoothing of the disparity is optimized by using L1 loss at the disparity gradient. Since the nature of discontinuities in depth often manifests at the gradients of an RGB image, this loss can be weighted with an edge perception term. In this embodiment, the depth estimation module guides disparity smoothing through the semantic edge output by the semantic edge detection module, and the formula is as follows:

wherein,

representing the gradient solution in the X and Y directions, respectively, d _i,j Denotes the value of (i, j) of the parallax in the RGB image, N denotes the number of pixels, S _i,j The value (i, j) in the semantic edge image is represented, and ∈ is a hyper-parameter and is set to 0.001 in the present embodiment.

Since there is no reason to generate right parallax using left view, the present embodiment cannot train in a way of generating right view using left parallax and right view. This embodiment proposes a dual-configuration approach, i.e. using left parallax and left view to generate right view instead of the above unreasonable approach. As shown in fig. 2, in this embodiment, the pixel value of a certain point of the left view is shifted by parallax value by a pixel and assigned to the position to obtain the right view. In the case of a perfect left-right attempt and left parallax, it can be considered that a right view of an object without occlusion can be reconstructed using the left view and left parallax; and searching the pixel value of a certain point on the right view by using the parallax value of the certain point of the left parallax, and assigning the pixel value to the point to reconstruct a left view, thereby obtaining a left view with completely dense pixel values.

The edge enhancement weighting module in the embodiment is based on the strong consistency relationship between the depth edge and the semantic edge, and utilizes the Sobel operator to extract the depth edge of the depth map as the input of the EEW module in the figure 3 so as to learn a dynamic weighting information W belonging to R ^h×w×8 . By utilizing the characteristics of side1-3Organization F ═ A ⁽¹⁾ ,A ⁽²⁾ ,A ⁽³⁾ And assigning a weight to each pixel to ensure that the edge information is strengthened and F is obtained, so as to form a final feature result to be fused of the semantic edge extraction branch, as shown in the following formula.

Is the ith class output of side5, where A ^f The method refers to the feature combination of the fused classification layer input to the CASSEN, and then carries out classification convolution of 1 x 1 of K-groups to generate an activation graph of a K channel, and the activation graph is also the final output of the semantic edge extraction branch.

The depth edge semantic classification module in the present embodiment combines the depth edge D extracted by the laplacian operator _edge And side4 feature, depth edge semantic classification prediction is performed through sharedcollocation and fusedClassification structures of CASSET, and multi-label loss of CASSET is adopted for supervision, as shown in FIG. 4. Specifically, a depth-classified ground truth prediction is first generated. And generating a basic truth value of the depth edge semantic classification by using the predicted depth edge and the basic truth value of the semantic edge. Since the ground truth values of the depth edge and the semantic edge are both binary mappings of 0-1, the intersection of the two can be regarded as the ground truth value of the depth edge semantic classification task, as follows:

wherein, Gt _sem-edge Refers to the truth value of the semantic edge extraction,

refers to the edge D by giving depth _edge Setting a hyper-parameter thresholdIn the obtained binary image, the hyper-parameter threshold is set to 0.5 in the present embodiment, that is, the obtained binary image is obtained

The embodiment achieves the optimal effect in monocular depth estimation at present, and is improved by 5.1% in the semantic edge extraction task compared with the reference network.

As can be easily found, the invention realizes the mutual benefit and benefit of the tasks of depth estimation and semantic edge extraction by sharing the feature extraction module. The invention utilizes semantic edge guided disparity smoothing, image dual reconstruction to improve depth estimation at the edges. In semantic edge detection, an edge enhancement weight strategy is proposed, which enhances edge pixels by learning weights from deep edges and assigning them to edge features, thereby improving the accuracy of semantic edges. The invention also provides a depth edge semantic classification model to realize semantic edge and depth edge consistency constraint so as to realize implicit loss monitoring.

Claims

1. A depth estimation method combined with semantic edges is characterized by comprising the following steps;

acquiring an image to be subjected to depth estimation;

2. The depth estimation method combined with semantic edge according to claim 1, wherein the depth estimation module guides disparity smoothing through the semantic edge output by the semantic edge detection module, and the formula is as follows:

wherein,

representing the gradient solution in the X and Y directions, respectively, d _i,j Denotes the value of (i, j) of the parallax in the RGB image, N denotes the number of pixels, S _i,j Represents the value of (i, j) in the semantic edge image, ε is the hyper-parameter.

3. The method of claim 1, wherein the image reconstruction includes shifting a pixel value of a point on the left view by a disparity value of the point and assigning the pixel value to the position to obtain a right view, and searching the right view for the pixel value of the point by using the disparity value of the point on the left view and assigning the pixel value to the point to reconstruct the left view.

4. The method for depth estimation in conjunction with semantic edges as defined in claim 1, wherein the edge enhancement weighting module comprises: extracting a depth edge of the depth prediction map through an edge detection operator, inputting the depth edge into an EEW unit, outputting dynamic weight information by the EEW unit, wherein the weight information needs to satisfy F ═ W × F, wherein F ═ { A { (A } W } F { (A } F } W } F ⁽¹⁾ ,A ⁽²⁾ ,A ⁽³⁾ }，A ⁽¹⁾ ,A ⁽²⁾ ,A ⁽³⁾ Respectively representing the feature information of different depths extracted by the shared feature extraction module, and W representing dynamic weight information.

5. The depth estimation method combined with semantic edges according to claim 1, wherein the depth edge semantic classification module performs depth edge semantic classification prediction by using CASENEt in combination with the depth edge extracted by the Laplacian and the feature information extracted by the shared feature extraction module, and performs supervision by using multi-label loss of CASENEt.

6. The depth estimation method combined with semantic edges as claimed in claim 5, wherein the depth edge semantic classification module generates a basic true value of depth edge semantic classification using the predicted depth edge and the basic true value of semantic edge; and taking the intersection part of the depth edge and the ground truth value of the semantic edge as the ground truth value of the depth edge semantic classification task.