CN117830369A - Dynamic scene depth estimation method integrating multi-cue depth information - Google Patents
Dynamic scene depth estimation method integrating multi-cue depth information Download PDFInfo
- Publication number
- CN117830369A CN117830369A CN202311744712.XA CN202311744712A CN117830369A CN 117830369 A CN117830369 A CN 117830369A CN 202311744712 A CN202311744712 A CN 202311744712A CN 117830369 A CN117830369 A CN 117830369A
- Authority
- CN
- China
- Prior art keywords
- depth
- layer
- dynamic
- image
- depth estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000004927 fusion Effects 0.000 claims description 48
- 230000004913 activation Effects 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 4
- 230000003068 static effect Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 10
- 230000008901 benefit Effects 0.000 description 8
- 238000013459 approach Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 238000004451 qualitative analysis Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Landscapes
- Image Processing (AREA)
Abstract
The invention provides a dynamic scene depth estimation method integrating multi-cue depth information. The method comprises the following steps: step 1: acquiring a sequence image containing a dynamic target; step 2: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image; step 3: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image; step 4: and fusing the first depth map and the second depth map to obtain a final depth map. The invention can enhance the robustness and generalization of the dynamic scene depth estimation.
Description
Technical Field
The invention relates to the technical field of dynamic scene depth estimation, in particular to a dynamic scene depth estimation method integrating multi-line depth information.
Background
Depth information plays a key role in numerous fields such as field understanding, three-dimensional reconstruction, virtual reality, automatic driving and the like, and attracts research enthusiasm of students in the fields of computer vision and mapping. The professional hardware equipment such as the laser radar can directly obtain more accurate depth information, but the cost is high, the measurement range is shorter, and the Doppler effect can appear when a moving object is captured, so that the measurement is deviated. However, in the real world, a completely static scene hardly exists, and the research of accurately acquiring the depth information of the dynamic scene has extremely strong research significance and wide development prospect.
The convolutional neural network is rapidly developed, and the image depth estimation method based on the deep learning can acquire more accurate depth information from single-frame or multi-frame images, and has the advantages of high cost effectiveness, strong environment perception capability and the like. However, recovering depth from a single frame image is an uncomfortable problem, and each pixel in a two-dimensional image only contains projection information of an object under a viewing angle, so that the position of the object associated with the projection information in space cannot be uniquely determined, and the same projection can be generated by objects with different distances, so that the non-uniqueness of the depth information is caused. To solve this problem, the method of single-frame image depth estimation is often supplemented with other constraints to ensure the uniqueness of the depth values: the multi-scale information fusion can increase the constraint on depth information; motion information of other sensors such as inertial measurement units (Inertial Measurement Unit, IMU) as auxiliary information; scene prior knowledge such as object size, texture, geometry, etc. constrains depth. The single-frame depth estimation method only depends on monocular clues, and the depth estimation effect accuracy of dynamic areas (such as moving automobiles, pedestrians and the like) is high.
Compared with a method for recovering depth by a single frame, the multi-frame method can obtain depth information with higher precision on the whole and is widely applied to three-dimensional reconstruction of a scene. Multi-frame depth estimation is based on geometric constraints, using cost volumes to match features between adjacent frames under different depth assumptions. And the dynamic point has displacement between adjacent frames, violating the static scene assumption, and causing the cost body to generate an error value, thereby misleading the network prediction. The multi-frame depth estimation method therefore encounters unavoidable challenges in the dynamic region.
Dynamic scene depth estimation refers to obtaining depth information of the entire dynamic foreground and static background. In order to solve the problem of dense three-dimensional reconstruction of dynamic scenes, the processing modes of dynamic prospects can be divided into the following two modes. Firstly, according to mask or optical flow information, all moving objects are identified, motion of dynamic pixels is estimated, and foreground and background are processed respectively. Secondly, the advantages of monocular clues and monocular clues are fused, and the monocular geometric features and the monocular features are learned at the feature level by using an attention mechanism. Both the two methods are to process images at the feature level, and cannot fully integrate the advantages of the single-frame and multi-view methods, and the methods have no generalization, so that the methods cannot be directly applied to other scenes.
Disclosure of Invention
In order to solve the problem that global accuracy is reduced when only monocular cues are used for depth estimation in a dynamic scene and only multi-cue dynamic region estimation is wrong, the invention provides a dynamic scene depth estimation method fused with multi-cue depth information, so that robustness and generalization of dynamic scene depth estimation are enhanced, and the method can be widely applied to other scenes.
The invention provides a dynamic scene depth estimation method fusing multi-cue depth information, which comprises the following steps:
step 1: acquiring a sequence image containing a dynamic target;
step 2: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image;
step 3: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image;
step 4: and fusing the first depth map and the second depth map to obtain a final depth map.
Further, in step 2, an ACVNet algorithm is selected as the multi-frame depth estimation algorithm.
Further, in step 2, a DMdepth algorithm is selected as a depth estimation algorithm that merges monocular and multiview cues.
Further, in step 4, specifically includes:
positioning a dynamic target area in the source image by utilizing a dynamic target mask in the target image and taking the dynamic target area as an area to be fused;
according to the difference between the source image and the target image, calculating and transmitting gradient information of the region to be fused according to the following formula to obtain a final depth map;
wherein D is multi Representing a first depth map, D fusion Representing a second depth map, Ω representing a region to be fused in the fused target image I,representing the gradient, F representing the final depth map, and ∈representing the fusion operation.
Further, in step 4, specifically includes:
positioning a dynamic target area in the source image by utilizing a dynamic target mask in the target image and taking the dynamic target area as an area to be fused;
inputting the first depth map and the region to be fused into a gradient fusion network to obtain a final depth map; the gradient fusion network comprises a gradient layer, an encoder, a decoder and an output layer which are sequentially connected; the gradient layer is used for extracting the gradient of the region to be fused; the encoder comprises 6 layers which are sequentially connected, wherein the first layer comprises a convolution layer and a GroupNorm layer which are sequentially connected, and the second layer to the fifth layer comprise a LeakyReLU activation layer, a convolution layer and a GroupNorm layer which are sequentially connected; the decoder comprises 6 layers, wherein the first layer to the sixth layer comprise a GroupNorm layer, an upsampling layer, a convolution layer and a LeakyReLU activation layer which are connected in sequence; the output layer comprises an up-sampling layer, a convolution layer and a LeakyReLU activation layer which are sequentially connected; wherein each layer of output in the decoder is jump connected as input to the output layer.
The invention has the beneficial effects that:
the thinking of feature level fusion of the invention starts from the depth map generated by each method, takes the depth map generated by the method only using multi-frame clues as a target image, and retains the better static background depth information; the method of fusing monocular and multi-view cues is used as a source image to keep the optimal dynamic front depth information. In the fusion process, dynamic region mask information is selected as a fusion region, dynamic foreground depth information is decoupled, and foreground information to be fused is accurately identified. And calculating and processing gradient information between the foreground image and the background image by using a poisson fusion method, so that the gradient of the foreground image is transmitted to the background image along the edge of the mask area, the foreground information is fully fused in good background information, and the depth estimation effect of the whole image is optimal. The method can enhance the robustness and generalization of the depth estimation of the dynamic scene, so that the method can be widely applied to other scenes.
Drawings
Fig. 1 is a flow chart of a dynamic scene depth estimation method with multi-cue depth information fusion provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a qualitative analysis of monocular cues, multi-view cues, and fusion of monocular and multi-view cues according to an existing experiment;
FIG. 3 is a schematic diagram of a case where (a) a dynamic pixel violates a standard constraint in a multi-view geometry, provided by an embodiment of the present invention; (b) Schematic diagram of specific situation of violating geometric constraint in dynamic scene;
fig. 4 is a schematic structural diagram of a gradient fusion network according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The existing experimental results (shown in fig. 2) show that the monocular method and the multiview/multiframe method have advantages in depth estimation, the results of the depth estimation are shown in fig. 2 (a) and (b), respectively, and the depth estimation results of the method of fusing monocular and multiview cues are shown in fig. 2 (c).
As can be seen from a qualitative analysis of the picture shown in FIG. 2Errors of the depth map generated by the multi-view depth estimation method are mainly concentrated in a trolley of a dynamic area, more errors exist in the background of the depth map generated by the monocular depth estimation method, and artifacts which are not originally existed are generated by the depth map generated by the method of fusing monocular and multi-view cues. The inventors believe that the artifact arises because only in a static environment, 3D points in space satisfy the projection relationship of multi-view geometry, while the dynamic foreground does not satisfy the assumption of a static environment. The dynamic point violates the epipolar geometry constraint in a particular form as shown in fig. 3 (a). Wherein x is 1 、x 2 And F is a basic matrix for matching point pair positions of two continuous frames of images. Constraints can be derived from epipolar, triangulation, basis matrix estimation, or reprojection error equations, the specific case of violating geometric constraints in a dynamic scenario is shown in FIG. 3 (b). Wherein: (1) point X 2 In frame I 2 The projection point in (2) is distant from the polar line l 2 Too far; (2) the back projection ray connected with the projection point by the camera optical center cannot intersect at a point; (3) the occurrence of dynamic characteristics can lead to false estimation of the basic matrix; (4) point X 2 In frame I 1 The distance between the re-projection feature and the observation feature is too large.
Table 1 shows the quantitative analysis results of different depth estimation methods on the KITTI data set. As can be seen from table 1, the overall error of the depth estimation method using only multi-frame cues is the lowest among the three methods, but the performance is the worst in the dynamic region; the effect of the depth estimation method using only monocular cues is far higher than that of multi-view cues in a dynamic region, but slightly weaker than that of the method of fusing monocular and multi-view cues, and the overall error is larger; the monocular and multiview cues blend approach works best in the dynamic region, but it affects the static multiview cues, resulting in a slightly higher overall region error than the approach using multiview cues alone.
TABLE 1
From the above analysis, it can be seen that: the fusion of the feature level parts (monocular and multi-view clue fusion) can better utilize the respective advantages of the single-frame clue and multi-frame clue methods to ensure that the estimation effect of the dynamic foreground region is optimal. However, its overall performance is not superior to depth estimation methods using multi-view cues alone. Therefore, the inventor believes that the monocular cues in feature level fusion have an influence on multi-frame cues, so that the feature level fusion method cannot fully exert the advantage of geometric constraint. And feature level fusion can increase the parameter quantity during model training, so that the performance of the model only performs best on a training data set, and the generalization capability is weakened.
In order to solve the above technical problems, the embodiment of the present invention jumps out of the idea of feature level fusion, and provides a dynamic scene depth estimation method for fusing multi-cue depth information, as shown in fig. 1, starting from a depth map generated by each method, including the following steps:
s101: acquiring a sequence image containing a dynamic target;
s102: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image;
specifically, in consideration of the performance of each existing multi-frame depth estimation algorithm, the present embodiment selects the ACVNet algorithm as the multi-frame depth estimation algorithm. The ACVNet algorithm may be referred to as "Attention Concatenation Volume for Accurate and Efficient Stereo Matching".
In the embodiment of the invention, the process of performing depth estimation by using an ACVNet algorithm mainly comprises the following steps: feature extraction, cost body construction, cost aggregation and parallax regression;
feature extraction: features are extracted using a network structure that is ResNet-like. Firstly, downsampling an input image by using convolution operation on the first three layers of a network; then generating 1/4 resolution characteristic information by residual connection; finally, all the features under 1/4 resolution are connected to generate attention weight; and compressing the feature map through two-layer convolution operation to construct an initial connector.
Constructing a cost body: geometric information is extracted from the correlation of the stereopair to generate an attention weight. In order to solve the problem of non-texture region mismatch, a more robust cost volume is constructed using multi-level adaptive block matching. Three different levels of feature maps are obtained from the feature extraction module, with channel numbers of 64, 128 and 128, respectively. For each level of pixels, a matching cost is calculated with a hole convolution of a predefined size and adaptive learning weights. By controlling the expansion ratio, it is ensured that the same number of pixels is maintained when calculating the similarity of the pixels in the center of the block. Finally, the similarity of two corresponding pixels is a weighted sum of the correlations between the corresponding pixels within the block. The generated cost volume is processed using an hourglass module. The module consists of four 3D convolutions with batch normalization and ReLU layers and two directly stacked hourglass networks. An hourglass network is a structure of an encoder-decoder, the encoder comprising four 3D convolutional layers, the decoder being made up of two 3D deconvolution layers.
Cost aggregation: the encoder in the cost aggregation module has an Output0, and each hourglass decoder has an Output denoted Output1 and Output2, respectively. For each output, two 3D convolution modules are used to generate a single-channel 4D cost volume, which is then converted to a probability volume in the disparity dimension by a softmax function and upsampled.
Parallax regression: calculating a final predicted value through a soft argmin function to generate a first depth map D multi 。
S103: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image;
specifically, a DMdepth algorithm is selected as a depth estimation algorithm that fuses monocular and multiview cues. For details of the DMdepth algorithm reference is made to "Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes".
In the embodiment of the invention, a sequence image { I ] of a given section of known camera intrinsic and pose information is assumed t-1 ,I t ,I t+1 (I) t-1 ,I t+1 Is the image frame I t Is the adjacent frame of (a) the purposeIs to I therein t Depth estimation is carried out to obtain a depth map D t The process of depth estimation using DMdepth algorithm mainly includes:
the monocular cues are first constructed as a monocular depth volume. The simple U-Net architecture is used for predicting the monocular depth, and a pseudo monocular depth value is generated; then each absolute depth value is changed into a single thermal vector to obtain a single thermal depth body C mono 。
The multi-view cues are then built as cost volumes. According to the depth assumption values of uniform sampling in the image pose, camera internal parameters and parallax range, we will make the adjacent frame image { I } t-1 ,I t+1 Reconstructing to the target image. Then SSIM is used for calculating pixel-level similarity between the reconstructed image and the target image, so that a cost body is constructed and obtained, and a multi-view cost body C is obtained multi 。
The advantages of the two cues are then fused using a cross cue attention fusion mechanism. First C is passed through the convolution layer multi 、C mono Downsampling to generate multi-view and monocular depth features F multi 、F mono . Then inputting the features into a cross cue attention module, extracting the relative internal relation between the two depth features, and then connecting the monocular and multi-view internal relation features to generate a final fusion feature F fused . Meanwhile, in order to keep the details in the original feature map, the original multi-view and the monocular feature are connected by adding a residual connection mode to obtain a depth connection feature F cat . The final characteristics can be expressed as: f=f fused +F cat . Inputting the fusion feature F and the target frame image into a final depth estimation network to obtain a final second depth map D fusion 。
S104: and fusing the first depth map and the second depth map to obtain a final depth map.
In particular, the purpose of the fusion operation is to require D fusion Transfer to D multi To preserve a higher static scene depth estimation effect while fusing high-precision dynamic foreground depth information.
In the embodiment of the invention, the multi-thread depth map is fused based on the poisson fusion idea. Convolutional neural networks are concerned with different detailed information when processing different types of data. The monocular depth estimation focuses on the information of strong textures in the scene, so that a better dynamic foreground is generated and a complex static background (such as leaves and the like) is ignored. When the multi-frame clue method is used for depth estimation, all calculation is under geometric constraint, so that the matching degree between images is higher, and the static scene estimation effect is stronger in fidelity. Therefore, fusing multi-cue depth information is a straightforward way to enhance the depth estimation effect. The purpose of our approach is to find D multi And D fusion Optimal fusion operation between:
F=D multi ⊕D fusion
inspired by poisson fusion, the purpose of the fusion operation is to require D fusion Transfer to D multi To preserve a higher static scene depth estimation effect while fusing high-precision dynamic foreground depth information.
In addition, in order to better determine the region to be fused, a dynamic target mask in the target image is required to be utilized first, and the dynamic target region is positioned in the source image and used as the region to be fused; and then, according to the difference between the source image and the target image and based on the poisson fusion idea, calculating and transmitting gradient information of the region to be fused according to the following formula to obtain a final depth map.
Wherein D is multi Representing a first depth map, D fusion Representing a second depth map, Ω representing a region to be fused in the fused target image I,representing the gradient, F representing the final depth map, and ∈representing the fusion operation. The meaning of the above formula is that the gradient domain information is focused only in the Ω region, while it is focused in other regionsAnd annotating the value range information of the image.
The embodiment of the invention jumps out of the thought of feature level fusion, and starts from the depth map generated by each method, the depth map generated by the method only using multi-frame clues is taken as a target image, and the better background information of the depth map is reserved; the method of fusing monocular and multi-view cues is used as a source image to keep the optimal dynamic front depth information. In general, dynamic areas of outdoor scenes (such as automobiles, pedestrians and the like) all have prominent edge information, and the pixel gray information of an image at the edge of an object changes obviously, and the gradient of the image reaches the maximum at the edge of the object. Inspired by poisson fusion, a fusion area is determined according to mask information, so that gradients of foreground images are transmitted to background images along the mask area, multi-view geometric constraints and monocular cues are fully utilized, and the overall dynamic scene depth estimation effect is optimal.
Example 2
On the basis of the above embodiment, the difference from the above embodiment is that the embodiment of the present invention uses a multi-layered gradient fusion network to approach poisson fusion, considering that poisson fusion is not trivial. The structure of the gradient fusion network is shown in fig. 4, and comprises a gradient layer grad, an encoder, a decoder and an output layer which are sequentially connected; the gradient layer is used for extracting the gradient of the region to be fused; the encoder comprises 6 layers which are sequentially connected, wherein the first layer comprises a convolution layer and a GroupNorm layer which are sequentially connected, and the second layer to the fifth layer comprise a LeakyReLU activation layer, a convolution layer and a GroupNorm layer which are sequentially connected; the decoder comprises 6 layers, wherein the first layer to the sixth layer comprise a GroupNorm layer, an upsampling layer, a convolution layer and a LeakyReLU activation layer which are connected in sequence; the output layer comprises an up-sampling layer, a convolution layer and a LeakyReLU activation layer which are sequentially connected; wherein each layer of output in the decoder is jump connected as input to the output layer. The purpose of using gradient layers is to have both types of threads share super parameters without degrading training performance.
Specifically, a dynamic target mask in a target image is utilized to position a dynamic target area in a source image and serve as an area to be fused; and inputting the first depth map and the region to be fused into a gradient fusion network to obtain a final depth map.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (5)
1. The dynamic scene depth estimation method integrating multi-cue depth information is characterized by comprising the following steps of:
step 1: acquiring a sequence image containing a dynamic target;
step 2: performing depth estimation on the sequence image containing the dynamic target by utilizing a multi-frame depth estimation algorithm, and acquiring a first depth image serving as a target image;
step 3: performing depth estimation on the sequence image containing the dynamic target by using a depth estimation algorithm fusing monocular and multi-view cues, and acquiring a second depth image serving as a source image;
step 4: and fusing the first depth map and the second depth map to obtain a final depth map.
2. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 2, an ACVNet algorithm is selected as the multi-frame depth estimation algorithm.
3. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 2, DMdepth algorithm is selected as the depth estimation algorithm for fusion of monocular and multi-view cues.
4. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 4, specifically comprising:
positioning a dynamic target area in the source image by utilizing a dynamic target mask in the target image and taking the dynamic target area as an area to be fused;
according to the difference between the source image and the target image, calculating and transmitting gradient information of the region to be fused according to the following formula to obtain a final depth map;
wherein D is multi Representing a first depth map, D fusion Representing a second depth map, Ω representing a region to be fused in the fused target image I,representing gradient, F representing the final depth map, +.>Representing a fusion operation.
5. The method for dynamic scene depth estimation with multi-cue depth information fusion according to claim 1, wherein in step 4, specifically comprising:
positioning a dynamic target area in the source image by utilizing a dynamic target mask in the target image and taking the dynamic target area as an area to be fused;
inputting the first depth map and the region to be fused into a gradient fusion network to obtain a final depth map; the gradient fusion network comprises a gradient layer, an encoder, a decoder and an output layer which are sequentially connected; the gradient layer is used for extracting the gradient of the region to be fused; the encoder comprises 6 layers which are sequentially connected, wherein the first layer comprises a convolution layer and a GroupNorm layer which are sequentially connected, and the second layer to the fifth layer comprise a LeakyReLU activation layer, a convolution layer and a GroupNorm layer which are sequentially connected; the decoder comprises 6 layers, wherein the first layer to the sixth layer comprise a GroupNorm layer, an upsampling layer, a convolution layer and a LeakyReLU activation layer which are connected in sequence; the output layer comprises an up-sampling layer, a convolution layer and a LeakyReLU activation layer which are sequentially connected; wherein each layer of output in the decoder is jump connected as input to the output layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311744712.XA CN117830369A (en) | 2023-12-18 | 2023-12-18 | Dynamic scene depth estimation method integrating multi-cue depth information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311744712.XA CN117830369A (en) | 2023-12-18 | 2023-12-18 | Dynamic scene depth estimation method integrating multi-cue depth information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117830369A true CN117830369A (en) | 2024-04-05 |
Family
ID=90507048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311744712.XA Pending CN117830369A (en) | 2023-12-18 | 2023-12-18 | Dynamic scene depth estimation method integrating multi-cue depth information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117830369A (en) |
-
2023
- 2023-12-18 CN CN202311744712.XA patent/CN117830369A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110782490B (en) | Video depth map estimation method and device with space-time consistency | |
US11954813B2 (en) | Three-dimensional scene constructing method, apparatus and system, and storage medium | |
Liu et al. | Continuous depth estimation for multi-view stereo | |
WO2018127007A1 (en) | Depth image acquisition method and system | |
CN108062769B (en) | Rapid depth recovery method for three-dimensional reconstruction | |
CN108734776A (en) | A kind of three-dimensional facial reconstruction method and equipment based on speckle | |
CN113077505B (en) | Monocular depth estimation network optimization method based on contrast learning | |
CN115035235A (en) | Three-dimensional reconstruction method and device | |
CN112288788A (en) | Monocular image depth estimation method | |
CN109644280B (en) | Method for generating hierarchical depth data of scene | |
CN113436254B (en) | Cascade decoupling pose estimation method | |
CN112270701B (en) | Parallax prediction method, system and storage medium based on packet distance network | |
CN111652922B (en) | Binocular vision-based monocular video depth estimation method | |
CN113269823A (en) | Depth data acquisition method and device, storage medium and electronic equipment | |
Zhou et al. | Single-view view synthesis with self-rectified pseudo-stereo | |
CN112927348A (en) | High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera | |
CN115965961B (en) | Local-global multi-mode fusion method, system, equipment and storage medium | |
CN112489097A (en) | Stereo matching method based on mixed 2D convolution and pseudo 3D convolution | |
Kao | Stereoscopic image generation with depth image based rendering | |
CN117830369A (en) | Dynamic scene depth estimation method integrating multi-cue depth information | |
CN112907645B (en) | Disparity map acquisition method, disparity map acquisition device, disparity map training method, electronic device, and medium | |
CN115330935A (en) | Three-dimensional reconstruction method and system based on deep learning | |
CN115908992A (en) | Binocular stereo matching method, device, equipment and storage medium | |
KR100655465B1 (en) | Method for real-time intermediate scene interpolation | |
Pan et al. | An automatic 2D to 3D video conversion approach based on RGB-D images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |