CN117830369A

CN117830369A - Dynamic scene depth estimation method integrating multi-cue depth information

Info

Publication number: CN117830369A
Application number: CN202311744712.XA
Authority: CN
Inventors: 程彬彬; 于英; 张永生; 王涛; 李力; 李磊; 张磊; 纪松; 张振超; 宋亮; 高寒; 柏栋; 江志鹏; 张珂
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-04-05

Abstract

The invention provides a dynamic scene depth estimation method integrating multi-cue depth information. The method comprises the following steps: step 1: acquiring a sequence image containing a dynamic target; step 2: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image; step 3: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image; step 4: and fusing the first depth map and the second depth map to obtain a final depth map. The invention can enhance the robustness and generalization of the dynamic scene depth estimation.

Description

Dynamic scene depth estimation method integrating multi-cue depth information

Technical Field

The invention relates to the technical field of dynamic scene depth estimation, in particular to a dynamic scene depth estimation method integrating multi-line depth information.

Background

Depth information plays a key role in numerous fields such as field understanding, three-dimensional reconstruction, virtual reality, automatic driving and the like, and attracts research enthusiasm of students in the fields of computer vision and mapping. The professional hardware equipment such as the laser radar can directly obtain more accurate depth information, but the cost is high, the measurement range is shorter, and the Doppler effect can appear when a moving object is captured, so that the measurement is deviated. However, in the real world, a completely static scene hardly exists, and the research of accurately acquiring the depth information of the dynamic scene has extremely strong research significance and wide development prospect.

The convolutional neural network is rapidly developed, and the image depth estimation method based on the deep learning can acquire more accurate depth information from single-frame or multi-frame images, and has the advantages of high cost effectiveness, strong environment perception capability and the like. However, recovering depth from a single frame image is an uncomfortable problem, and each pixel in a two-dimensional image only contains projection information of an object under a viewing angle, so that the position of the object associated with the projection information in space cannot be uniquely determined, and the same projection can be generated by objects with different distances, so that the non-uniqueness of the depth information is caused. To solve this problem, the method of single-frame image depth estimation is often supplemented with other constraints to ensure the uniqueness of the depth values: the multi-scale information fusion can increase the constraint on depth information; motion information of other sensors such as inertial measurement units (Inertial Measurement Unit, IMU) as auxiliary information; scene prior knowledge such as object size, texture, geometry, etc. constrains depth. The single-frame depth estimation method only depends on monocular clues, and the depth estimation effect accuracy of dynamic areas (such as moving automobiles, pedestrians and the like) is high.

Compared with a method for recovering depth by a single frame, the multi-frame method can obtain depth information with higher precision on the whole and is widely applied to three-dimensional reconstruction of a scene. Multi-frame depth estimation is based on geometric constraints, using cost volumes to match features between adjacent frames under different depth assumptions. And the dynamic point has displacement between adjacent frames, violating the static scene assumption, and causing the cost body to generate an error value, thereby misleading the network prediction. The multi-frame depth estimation method therefore encounters unavoidable challenges in the dynamic region.

Dynamic scene depth estimation refers to obtaining depth information of the entire dynamic foreground and static background. In order to solve the problem of dense three-dimensional reconstruction of dynamic scenes, the processing modes of dynamic prospects can be divided into the following two modes. Firstly, according to mask or optical flow information, all moving objects are identified, motion of dynamic pixels is estimated, and foreground and background are processed respectively. Secondly, the advantages of monocular clues and monocular clues are fused, and the monocular geometric features and the monocular features are learned at the feature level by using an attention mechanism. Both the two methods are to process images at the feature level, and cannot fully integrate the advantages of the single-frame and multi-view methods, and the methods have no generalization, so that the methods cannot be directly applied to other scenes.

Disclosure of Invention

In order to solve the problem that global accuracy is reduced when only monocular cues are used for depth estimation in a dynamic scene and only multi-cue dynamic region estimation is wrong, the invention provides a dynamic scene depth estimation method fused with multi-cue depth information, so that robustness and generalization of dynamic scene depth estimation are enhanced, and the method can be widely applied to other scenes.

The invention provides a dynamic scene depth estimation method fusing multi-cue depth information, which comprises the following steps:

step 1: acquiring a sequence image containing a dynamic target;

step 2: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image;

step 3: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image;

step 4: and fusing the first depth map and the second depth map to obtain a final depth map.

Further, in step 2, an ACVNet algorithm is selected as the multi-frame depth estimation algorithm.

Further, in step 2, a DMdepth algorithm is selected as a depth estimation algorithm that merges monocular and multiview cues.

Further, in step 4, specifically includes:

positioning a dynamic target area in the source image by utilizing a dynamic target mask in the target image and taking the dynamic target area as an area to be fused;

according to the difference between the source image and the target image, calculating and transmitting gradient information of the region to be fused according to the following formula to obtain a final depth map;

wherein D is _multi Representing a first depth map, D _fusion Representing a second depth map, Ω representing a region to be fused in the fused target image I,representing the gradient, F representing the final depth map, and ∈representing the fusion operation.

Further, in step 4, specifically includes:

inputting the first depth map and the region to be fused into a gradient fusion network to obtain a final depth map; the gradient fusion network comprises a gradient layer, an encoder, a decoder and an output layer which are sequentially connected; the gradient layer is used for extracting the gradient of the region to be fused; the encoder comprises 6 layers which are sequentially connected, wherein the first layer comprises a convolution layer and a GroupNorm layer which are sequentially connected, and the second layer to the fifth layer comprise a LeakyReLU activation layer, a convolution layer and a GroupNorm layer which are sequentially connected; the decoder comprises 6 layers, wherein the first layer to the sixth layer comprise a GroupNorm layer, an upsampling layer, a convolution layer and a LeakyReLU activation layer which are connected in sequence; the output layer comprises an up-sampling layer, a convolution layer and a LeakyReLU activation layer which are sequentially connected; wherein each layer of output in the decoder is jump connected as input to the output layer.

The invention has the beneficial effects that:

the thinking of feature level fusion of the invention starts from the depth map generated by each method, takes the depth map generated by the method only using multi-frame clues as a target image, and retains the better static background depth information; the method of fusing monocular and multi-view cues is used as a source image to keep the optimal dynamic front depth information. In the fusion process, dynamic region mask information is selected as a fusion region, dynamic foreground depth information is decoupled, and foreground information to be fused is accurately identified. And calculating and processing gradient information between the foreground image and the background image by using a poisson fusion method, so that the gradient of the foreground image is transmitted to the background image along the edge of the mask area, the foreground information is fully fused in good background information, and the depth estimation effect of the whole image is optimal. The method can enhance the robustness and generalization of the depth estimation of the dynamic scene, so that the method can be widely applied to other scenes.

Drawings

Fig. 1 is a flow chart of a dynamic scene depth estimation method with multi-cue depth information fusion provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a qualitative analysis of monocular cues, multi-view cues, and fusion of monocular and multi-view cues according to an existing experiment;

FIG. 3 is a schematic diagram of a case where (a) a dynamic pixel violates a standard constraint in a multi-view geometry, provided by an embodiment of the present invention; (b) Schematic diagram of specific situation of violating geometric constraint in dynamic scene;

fig. 4 is a schematic structural diagram of a gradient fusion network according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The existing experimental results (shown in fig. 2) show that the monocular method and the multiview/multiframe method have advantages in depth estimation, the results of the depth estimation are shown in fig. 2 (a) and (b), respectively, and the depth estimation results of the method of fusing monocular and multiview cues are shown in fig. 2 (c).

As can be seen from a qualitative analysis of the picture shown in FIG. 2Errors of the depth map generated by the multi-view depth estimation method are mainly concentrated in a trolley of a dynamic area, more errors exist in the background of the depth map generated by the monocular depth estimation method, and artifacts which are not originally existed are generated by the depth map generated by the method of fusing monocular and multi-view cues. The inventors believe that the artifact arises because only in a static environment, 3D points in space satisfy the projection relationship of multi-view geometry, while the dynamic foreground does not satisfy the assumption of a static environment. The dynamic point violates the epipolar geometry constraint in a particular form as shown in fig. 3 (a). Wherein x is ₁ 、x ₂ And F is a basic matrix for matching point pair positions of two continuous frames of images. Constraints can be derived from epipolar, triangulation, basis matrix estimation, or reprojection error equations, the specific case of violating geometric constraints in a dynamic scenario is shown in FIG. 3 (b). Wherein: (1) point X ₂ In frame I ₂ The projection point in (2) is distant from the polar line l ₂ Too far; (2) the back projection ray connected with the projection point by the camera optical center cannot intersect at a point; (3) the occurrence of dynamic characteristics can lead to false estimation of the basic matrix; (4) point X ₂ In frame I ₁ The distance between the re-projection feature and the observation feature is too large.

Table 1 shows the quantitative analysis results of different depth estimation methods on the KITTI data set. As can be seen from table 1, the overall error of the depth estimation method using only multi-frame cues is the lowest among the three methods, but the performance is the worst in the dynamic region; the effect of the depth estimation method using only monocular cues is far higher than that of multi-view cues in a dynamic region, but slightly weaker than that of the method of fusing monocular and multi-view cues, and the overall error is larger; the monocular and multiview cues blend approach works best in the dynamic region, but it affects the static multiview cues, resulting in a slightly higher overall region error than the approach using multiview cues alone.

TABLE 1

From the above analysis, it can be seen that: the fusion of the feature level parts (monocular and multi-view clue fusion) can better utilize the respective advantages of the single-frame clue and multi-frame clue methods to ensure that the estimation effect of the dynamic foreground region is optimal. However, its overall performance is not superior to depth estimation methods using multi-view cues alone. Therefore, the inventor believes that the monocular cues in feature level fusion have an influence on multi-frame cues, so that the feature level fusion method cannot fully exert the advantage of geometric constraint. And feature level fusion can increase the parameter quantity during model training, so that the performance of the model only performs best on a training data set, and the generalization capability is weakened.

In order to solve the above technical problems, the embodiment of the present invention jumps out of the idea of feature level fusion, and provides a dynamic scene depth estimation method for fusing multi-cue depth information, as shown in fig. 1, starting from a depth map generated by each method, including the following steps:

s101: acquiring a sequence image containing a dynamic target;

s102: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image;

specifically, in consideration of the performance of each existing multi-frame depth estimation algorithm, the present embodiment selects the ACVNet algorithm as the multi-frame depth estimation algorithm. The ACVNet algorithm may be referred to as "Attention Concatenation Volume for Accurate and Efficient Stereo Matching".

In the embodiment of the invention, the process of performing depth estimation by using an ACVNet algorithm mainly comprises the following steps: feature extraction, cost body construction, cost aggregation and parallax regression;

feature extraction: features are extracted using a network structure that is ResNet-like. Firstly, downsampling an input image by using convolution operation on the first three layers of a network; then generating 1/4 resolution characteristic information by residual connection; finally, all the features under 1/4 resolution are connected to generate attention weight; and compressing the feature map through two-layer convolution operation to construct an initial connector.

Constructing a cost body: geometric information is extracted from the correlation of the stereopair to generate an attention weight. In order to solve the problem of non-texture region mismatch, a more robust cost volume is constructed using multi-level adaptive block matching. Three different levels of feature maps are obtained from the feature extraction module, with channel numbers of 64, 128 and 128, respectively. For each level of pixels, a matching cost is calculated with a hole convolution of a predefined size and adaptive learning weights. By controlling the expansion ratio, it is ensured that the same number of pixels is maintained when calculating the similarity of the pixels in the center of the block. Finally, the similarity of two corresponding pixels is a weighted sum of the correlations between the corresponding pixels within the block. The generated cost volume is processed using an hourglass module. The module consists of four 3D convolutions with batch normalization and ReLU layers and two directly stacked hourglass networks. An hourglass network is a structure of an encoder-decoder, the encoder comprising four 3D convolutional layers, the decoder being made up of two 3D deconvolution layers.

Cost aggregation: the encoder in the cost aggregation module has an Output0, and each hourglass decoder has an Output denoted Output1 and Output2, respectively. For each output, two 3D convolution modules are used to generate a single-channel 4D cost volume, which is then converted to a probability volume in the disparity dimension by a softmax function and upsampled.

Parallax regression: calculating a final predicted value through a soft argmin function to generate a first depth map D _multi 。

S103: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image;

specifically, a DMdepth algorithm is selected as a depth estimation algorithm that fuses monocular and multiview cues. For details of the DMdepth algorithm reference is made to "Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes".

In the embodiment of the invention, a sequence image { I ] of a given section of known camera intrinsic and pose information is assumed _t-1 ，I _t ，I _t+1 (I) _t-1 ，I _t+1 Is the image frame I _t Is the adjacent frame of (a) the purposeIs to I therein _t Depth estimation is carried out to obtain a depth map D _t The process of depth estimation using DMdepth algorithm mainly includes:

the monocular cues are first constructed as a monocular depth volume. The simple U-Net architecture is used for predicting the monocular depth, and a pseudo monocular depth value is generated; then each absolute depth value is changed into a single thermal vector to obtain a single thermal depth body C _mono 。

The multi-view cues are then built as cost volumes. According to the depth assumption values of uniform sampling in the image pose, camera internal parameters and parallax range, we will make the adjacent frame image { I } _t-1 ，I _t+1 Reconstructing to the target image. Then SSIM is used for calculating pixel-level similarity between the reconstructed image and the target image, so that a cost body is constructed and obtained, and a multi-view cost body C is obtained _multi 。

The advantages of the two cues are then fused using a cross cue attention fusion mechanism. First C is passed through the convolution layer _multi 、C _mono Downsampling to generate multi-view and monocular depth features F _multi 、F _mono . Then inputting the features into a cross cue attention module, extracting the relative internal relation between the two depth features, and then connecting the monocular and multi-view internal relation features to generate a final fusion feature F _fused . Meanwhile, in order to keep the details in the original feature map, the original multi-view and the monocular feature are connected by adding a residual connection mode to obtain a depth connection feature F _cat . The final characteristics can be expressed as: f=f _fused +F _cat . Inputting the fusion feature F and the target frame image into a final depth estimation network to obtain a final second depth map D _fusion 。

S104: and fusing the first depth map and the second depth map to obtain a final depth map.

In particular, the purpose of the fusion operation is to require D _fusion Transfer to D _multi To preserve a higher static scene depth estimation effect while fusing high-precision dynamic foreground depth information.

In the embodiment of the invention, the multi-thread depth map is fused based on the poisson fusion idea. Convolutional neural networks are concerned with different detailed information when processing different types of data. The monocular depth estimation focuses on the information of strong textures in the scene, so that a better dynamic foreground is generated and a complex static background (such as leaves and the like) is ignored. When the multi-frame clue method is used for depth estimation, all calculation is under geometric constraint, so that the matching degree between images is higher, and the static scene estimation effect is stronger in fidelity. Therefore, fusing multi-cue depth information is a straightforward way to enhance the depth estimation effect. The purpose of our approach is to find D _multi And D _fusion Optimal fusion operation between:

F＝D _multi ⊕D _fusion

inspired by poisson fusion, the purpose of the fusion operation is to require D _fusion Transfer to D _multi To preserve a higher static scene depth estimation effect while fusing high-precision dynamic foreground depth information.

In addition, in order to better determine the region to be fused, a dynamic target mask in the target image is required to be utilized first, and the dynamic target region is positioned in the source image and used as the region to be fused; and then, according to the difference between the source image and the target image and based on the poisson fusion idea, calculating and transmitting gradient information of the region to be fused according to the following formula to obtain a final depth map.

Wherein D is _multi Representing a first depth map, D _fusion Representing a second depth map, Ω representing a region to be fused in the fused target image I,representing the gradient, F representing the final depth map, and ∈representing the fusion operation. The meaning of the above formula is that the gradient domain information is focused only in the Ω region, while it is focused in other regionsAnd annotating the value range information of the image.

The embodiment of the invention jumps out of the thought of feature level fusion, and starts from the depth map generated by each method, the depth map generated by the method only using multi-frame clues is taken as a target image, and the better background information of the depth map is reserved; the method of fusing monocular and multi-view cues is used as a source image to keep the optimal dynamic front depth information. In general, dynamic areas of outdoor scenes (such as automobiles, pedestrians and the like) all have prominent edge information, and the pixel gray information of an image at the edge of an object changes obviously, and the gradient of the image reaches the maximum at the edge of the object. Inspired by poisson fusion, a fusion area is determined according to mask information, so that gradients of foreground images are transmitted to background images along the mask area, multi-view geometric constraints and monocular cues are fully utilized, and the overall dynamic scene depth estimation effect is optimal.

Example 2

On the basis of the above embodiment, the difference from the above embodiment is that the embodiment of the present invention uses a multi-layered gradient fusion network to approach poisson fusion, considering that poisson fusion is not trivial. The structure of the gradient fusion network is shown in fig. 4, and comprises a gradient layer grad, an encoder, a decoder and an output layer which are sequentially connected; the gradient layer is used for extracting the gradient of the region to be fused; the encoder comprises 6 layers which are sequentially connected, wherein the first layer comprises a convolution layer and a GroupNorm layer which are sequentially connected, and the second layer to the fifth layer comprise a LeakyReLU activation layer, a convolution layer and a GroupNorm layer which are sequentially connected; the decoder comprises 6 layers, wherein the first layer to the sixth layer comprise a GroupNorm layer, an upsampling layer, a convolution layer and a LeakyReLU activation layer which are connected in sequence; the output layer comprises an up-sampling layer, a convolution layer and a LeakyReLU activation layer which are sequentially connected; wherein each layer of output in the decoder is jump connected as input to the output layer. The purpose of using gradient layers is to have both types of threads share super parameters without degrading training performance.

Specifically, a dynamic target mask in a target image is utilized to position a dynamic target area in a source image and serve as an area to be fused; and inputting the first depth map and the region to be fused into a gradient fusion network to obtain a final depth map.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The dynamic scene depth estimation method integrating multi-cue depth information is characterized by comprising the following steps of:

step 1: acquiring a sequence image containing a dynamic target;

2. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 2, an ACVNet algorithm is selected as the multi-frame depth estimation algorithm.

3. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 2, DMdepth algorithm is selected as the depth estimation algorithm for fusion of monocular and multi-view cues.

4. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 4, specifically comprising:

wherein D is _multi Representing a first depth map, D _fusion Representing a second depth map, Ω representing a region to be fused in the fused target image I,representing gradient, F representing the final depth map, +.>Representing a fusion operation.

5. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 4, specifically comprising: